Different tanimoto scores using compr

User ed9697d993

15-12-2009 11:09:15

Hello,


I am using 5.2.1_1 version of JChem and I encounter the following problems:


         -I convert a set of molecules in SMILES format (which we will call set 1) to fingerprints using generateMD. I've got some molecules which got the same smiles and they are converted to the same fingerprints (so far so good). But when I compare this set to a lone molecule (using compr), the tanimoto score are different for the two identical molecules ! They do not greatly differ (0.4018 and 0.4042) but they are not the same nonetheless. If I switch the positions of the molecules the scores switch (the first molecule in the list gets the highest score all the times).


         -When I merge the set of molecules (set 1) with another set of molecules (set 2) and I rerun the comparision, I obtain different scores for set 1 than in the first comparison (but the fingerprints are always the same!).


I searched in compr documentation: is it the heuristic method of compr that is involved in this strange behavior?


Thanks in advance for your help.


Florent.

ChemAxon efa1591b5a

17-12-2009 10:45:40

Hi Florent,


Oops, that sounds rather odd. I will investigate and try to reproduce.


Compr does not rely on any improper heuristics thus results should be consistent and deterministic. Even if compr used heuristics, the tanimoto for two equal fingerprints would have to be the same. So this is clearly a bug.


I  will also check the recent version soon to be released  (5.3).


In the meantime, you may wish to try screenmd, in case if you compare against a single molecule, or a small set of molecules. Screenmd is intended for somewhat different purpose, virtual screening, but if you set dissimilarity threshold to 1 then there is no filtering. Please get back to the forum if you need assistance with the parameter setting. Screenmd will produce the complete dissimilarity matrix - will that help? 


I will let you know my findings.


Kind regards


Miklos


 

ChemAxon efa1591b5a

18-12-2009 10:00:15

Hi again,


I tried to reproduce the faulty behaviour you described in your previous post but I did not manage. Compr worked consistently.


Can you perhaps send us the two molecules (or the entire input set, in case if it is not confidential) along with the exact commands you tried to execute? That would be a great help for us.


Many thanks and apologies for the inconvenience this problem caused.


 


Regards


Miklos

User ed9697d993

18-12-2009 13:28:55

Hi Miklos,


Thank you for the time you take to investigate the problem.


Here is an example of input, commands and output:


My query molecule is the following: "C1CN(C(=O)NC1=O)C2C(C(C(O2)COP(=O)(O)O)O)O udp1" which is in the file udp1.ism.
My set of reference is this one:


C1C(=O)NC(=O)N(C1=O)C2C(C(C(O2)COP(=O)(O)O)O)O DB03668
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB04460
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB02666
C1CN(C(=O)NC1=O)C2CC(C(O2)CO)O DB03562
C(C1C(C(C2(O1)C(=O)NC(=O)N2)O)O)OP(=O)(O)O DB02493
C(C1C(C(C2(O1)C(=O)NC(=O)N2)O)O)OP(=O)(O)O DB02150
C(C1C(C(C(O1)NC(=O)CN)O)O)OP(=O)([O-])[O-] DB02236
C1C(C(OC1N2C=CC(NC2=O)O)COP(=O)(O)O)O DB04280
C1[NH2+]C(=C(N1C2C(C(C(O2)COP(=O)([O-])[O-])O)O)[O-])C(=O)N DB01945
CN1C(=O)C2(C(C(C(C(O2)CO)O)O)O)NC1=O DB03479
C1CN(C(=O)NC1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O H2U
C1C(=O)NC(=O)N(C1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O BMQ
C[C@]1(CN(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O)O 64T
C[C@@]1(CN(C(=O)NC1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O)F FMU
C[C@]1([C@@H](N(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O)O)O CTG
C1[C@@H](N(C(=O)NC1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O)C(=O)O 2OM
C[C@@H]1CN(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O PBT
C[C@]1(CN(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O)N 64P
C1[C@@H]([C@H](O[C@H]1N2CC(=[Te])C(=O)NC2=O)COP(=O)(O)O)O TTI
C[C@@H]1CN2[C@H]1N(C2=O)[C@H]3C[C@@H]([C@H](O3)COP(=O)(O)O)O TA3


Which contains molecules from DrugBank and PDB and is named best_udp1.ism.


I first generate the fingerprints like this:


               -My first command is this one: sh generatemd c best_udp1.ism  -k CF -o refset.pf


               -Then I type: sh generatemd c udp1.ism  -k CF -o udp1.pf


I delete the first line in each file (configuration parameters, which make compr stop).


I realise the comparison:


compr -f 1024 -t 0.4 -g -z -L -i udp1.pf refset.pf -o results.txt


And I obtain the following results:


id    minD    nneib    simcnt    avgD    maxD    list_of_similar_objects ...
1    0,0160    1    1    0,0160    0,0160    1
2    0,2447    1    1    0,2447    0,2447    1
3    0,2430    1    1    0,2430    0,2430    1
4    0,2449    1    1    0,2449    0,2449    1
5    0,2629    1    1    0,2629    0,2629    1
6    0,2660    1    1    0,2660    0,2660    1
7    0,2829    1    1    0,2829    0,2829    1
8    0,3390    1    1    0,3390    0,3390    1
9    0,3611    1    1    0,3611    0,3611    1
10    0,3808    1    1    0,3808    0,3808    1
11    0,0040    1    1    0,0040    0,0040    1
12    0,0219    1    1    0,0219    0,0219    1
13    0,0763    1    1    0,0763    0,0763    1
14    0,0804    1    1    0,0804    0,0804    1
15    0,0918    1    1    0,0918    0,0918    1
16    0,1022    1    1    0,1022    0,1022    1
17    0,0962    1    1    0,0962    0,0962    1
18    0,1341    1    1    0,1341    0,1341    1
19    0,1458    1    1    0,1458    0,1458    1
20    0,1647    1    1    0,1647    0,1647    1

STATISTICS

Number of objects in set 1 = 1
Number of objects in set 2 = 20
Minimum dissimilarity between sets = 0.00404042
Average dissimilarity between sets = 0.17792888
Maximum dissimilarity between sets = 0.38078904


The problem is the following:


For


C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB04460
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB02666


I obtain:


2    0,2447    1    1    0,2447    0,2447    1
3    0,2430    1    1    0,2430    0,2430    1


Same molecules, almost same scores... but the difference is here.


Am I doing something wrong? Thank you again for your answers !


Florent.


P.S: I attach the fingerprints file, which are called .txt here (for attachement requirements) but are in fact called udp1.pf and refset1.pf.

ChemAxon efa1591b5a

06-01-2010 12:08:41

Hi Florent,


Thank you for the very detailed description of the problem. I understand it and at the first glance it appears to me that your observation is correct: there should be no difference in the two scores and that you did everything well. I need to further investigate it, I will do that later today and get back to you soon.


Thank you for your patience.


Kind regards


Miklos

User ed9697d993

15-01-2010 08:48:25

Hi Miklos, 


 


First of all, happy new yeah ! (not too late I hope)


So, did you manage to reproduce these results ?

Florent.

ChemAxon efa1591b5a

27-01-2010 13:49:28

Hi Florent,


I managed to take a look at the details of your post and played with your files and commands.


I overlooked something that might seem to be a nuisance yet it's a very important detail.


I delete the first line in each file (configuration parameters, which make compr stop).

There's more than the configuration parameters: in the output of generatemd there are unique identifiers stored in each line, per fingerprint. This is not ignored by compr when it reads the input but considered as the first 32 bits of the fingerprint (as the fingerprint is stored in decimal format in this fingerprint file).


Unfortunately, compr is not safe enough to provide proper warning that some values (one at the end of each line in your cases) have not been processed (actually, those were completely ignored) while the input file was read. This introduces 'noise' in our fingerprint, and thus the id of your 3rd molecule is one larger then the previous one, this 1 bit difference in the two fingerprints is reflected in the similarity scores (they are close but not the same, as you pointed out).


Is the above given explanation clear?


The right way to produce a fingerprint file that is directly suitable for further processing by compr (or other command-line tools in the clustering package) is to apply the -D (decimal output) option flag in the command line of generatemd. Your command to generate the chemical fingerprint should look like this:


    generatemd c best_udp1.ism  -k CF -o refset.pf -D

This does not place the unique id in the first field of each row, those will only contain fingerprint values. Apart form that, the first header line (that you had to delete manually) is also omitted by -D. 


With this you get a fingerprint file that can be directly fed in to compr. You should get an output like this:


id      minD    nneib   simcnt  avgD    maxD    list_of_similar_objects ...
1 0.0176 1 1 0.0176 0.0176 1
2 0.2480 1 1 0.2480 0.2480 1
3 0.2480 1 1 0.2480 0.2480 1
4 0.2430 1 1 0.2430 0.2430 1
5 0.2675 1 1 0.2675 0.2675 1
6 0.2675 1 1 0.2675 0.2675 1
7 0.2857 1 1 0.2857 0.2857 1
8 0.3400 1 1 0.3400 0.3400 1
9 0.3636 1 1 0.3636 0.3636 1
10 0.3834 1 1 0.3834 0.3834 1
11 0.0000 1 1 0.0000 0.0000 1
12 0.0176 1 1 0.0176 0.0176 1
13 0.0732 1 1 0.0732 0.0732 1
14 0.0789 1 1 0.0789 0.0789 1
15 0.0868 1 1 0.0868 0.0868 1
16 0.1004 1 1 0.1004 0.1004 1
17 0.0927 1 1 0.0927 0.0927 1
18 0.1283 1 1 0.1283 0.1283 1
19 0.1402 1 1 0.1402 0.1402 1
20 0.1605 1 1 0.1605 0.1605 1

STATISTICS

Number of objects in set 1 = 1
Number of objects in set 2 = 20
Minimum dissimilarity between sets = 0.0

Do you think this is the result you had expected?


Apologies that it took as somewhat longer than normal to resolve this problem. I hope it still helps.


Regards,


Miklos

User ed9697d993

28-01-2010 12:54:17

Hi Miklos !


I understood the explanation, thanks for explaning the ins and outs of the problem.
It works correctly now.
Thank you for your support !


Florent.

ChemAxon efa1591b5a

28-01-2010 16:46:30

Hi Florent,


I'm glad it works, finally.


If you can afford some spare time I'd be happy to learn more about the problem you are working on, in particular  how compr meets your requirements, or put it in a more provocative context: what would be 'the ultimate tool' that you could use the best to help your work.


Kind regards,


Miklos