Hi Miklos,
Thank you for the time you take to investigate the problem.
Here is an example of input, commands and output:
My query molecule is the following: "C1CN(C(=O)NC1=O)C2C(C(C(O2)COP(=O)(O)O)O)O udp1" which is in the file udp1.ism.
My set of reference is this one:
C1C(=O)NC(=O)N(C1=O)C2C(C(C(O2)COP(=O)(O)O)O)O DB03668
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB04460
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB02666
C1CN(C(=O)NC1=O)C2CC(C(O2)CO)O DB03562
C(C1C(C(C2(O1)C(=O)NC(=O)N2)O)O)OP(=O)(O)O DB02493
C(C1C(C(C2(O1)C(=O)NC(=O)N2)O)O)OP(=O)(O)O DB02150
C(C1C(C(C(O1)NC(=O)CN)O)O)OP(=O)([O-])[O-] DB02236
C1C(C(OC1N2C=CC(NC2=O)O)COP(=O)(O)O)O DB04280
C1[NH2+]C(=C(N1C2C(C(C(O2)COP(=O)([O-])[O-])O)O)[O-])C(=O)N DB01945
CN1C(=O)C2(C(C(C(C(O2)CO)O)O)O)NC1=O DB03479
C1CN(C(=O)NC1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O H2U
C1C(=O)NC(=O)N(C1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O BMQ
C[C@]1(CN(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O)O 64T
C[C@@]1(CN(C(=O)NC1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O)F FMU
C[C@]1([C@@H](N(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O)O)O CTG
C1[C@@H](N(C(=O)NC1=O)[C@H]2[C@@H]([C@@H]([C@H](O2)COP(=O)(O)O)O)O)C(=O)O 2OM
C[C@@H]1CN(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O PBT
C[C@]1(CN(C(=O)NC1=O)[C@H]2C[C@@H]([C@H](O2)COP(=O)(O)O)O)N 64P
C1[C@@H]([C@H](O[C@H]1N2CC(=[Te])C(=O)NC2=O)COP(=O)(O)O)O TTI
C[C@@H]1CN2[C@H]1N(C2=O)[C@H]3C[C@@H]([C@H](O3)COP(=O)(O)O)O TA3
Which contains molecules from DrugBank and PDB and is named best_udp1.ism.
I first generate the fingerprints like this:
-My first command is this one: sh generatemd c best_udp1.ism -k CF -o refset.pf
-Then I type: sh generatemd c udp1.ism -k CF -o udp1.pf
I delete the first line in each file (configuration parameters, which make compr stop).
I realise the comparison:
compr -f 1024 -t 0.4 -g -z -L -i udp1.pf refset.pf -o results.txt
And I obtain the following results:
id minD nneib simcnt avgD maxD list_of_similar_objects ...
1 0,0160 1 1 0,0160 0,0160 1
2 0,2447 1 1 0,2447 0,2447 1
3 0,2430 1 1 0,2430 0,2430 1
4 0,2449 1 1 0,2449 0,2449 1
5 0,2629 1 1 0,2629 0,2629 1
6 0,2660 1 1 0,2660 0,2660 1
7 0,2829 1 1 0,2829 0,2829 1
8 0,3390 1 1 0,3390 0,3390 1
9 0,3611 1 1 0,3611 0,3611 1
10 0,3808 1 1 0,3808 0,3808 1
11 0,0040 1 1 0,0040 0,0040 1
12 0,0219 1 1 0,0219 0,0219 1
13 0,0763 1 1 0,0763 0,0763 1
14 0,0804 1 1 0,0804 0,0804 1
15 0,0918 1 1 0,0918 0,0918 1
16 0,1022 1 1 0,1022 0,1022 1
17 0,0962 1 1 0,0962 0,0962 1
18 0,1341 1 1 0,1341 0,1341 1
19 0,1458 1 1 0,1458 0,1458 1
20 0,1647 1 1 0,1647 0,1647 1
STATISTICS
Number of objects in set 1 = 1
Number of objects in set 2 = 20
Minimum dissimilarity between sets = 0.00404042
Average dissimilarity between sets = 0.17792888
Maximum dissimilarity between sets = 0.38078904
The problem is the following:
For
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB04460
C(CC(C(=O)O)N(CO)O)CN1C(=O)C2(C(C(C(O2)COP(=O)(O)O)O)O)NC1=O DB02666
I obtain:
2 0,2447 1 1 0,2447 0,2447 1
3 0,2430 1 1 0,2430 0,2430 1
Same molecules, almost same scores... but the difference is here.
Am I doing something wrong? Thank you again for your answers !
Florent.
P.S: I attach the fingerprints file, which are called .txt here (for attachement requirements) but are in fact called udp1.pf and refset1.pf.