Problems with jc_equals and jc_compare

User 8139ea8dbd

30-04-2007 22:01:00

We have two smiles:





Smiles A: 'OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO'


Smiles B: 'OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1'





Question 1. Are they the same?





select jc_compare( 'OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO','OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1','t:e queryType:l') from dual;


returns 1





select jc_equals('OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO','OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1') from dual;


returns 0





Is this a bug?





Question 2. Is this a bug related to jcf_molconvert?


select jc_compare('OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO', jcf_molconvert('OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO', 'smiles:ua_day'), 't:e queryType:l') from dual;


returns 1, okay





select jc_equals('OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO', jcf_molconvert('OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO', 'smiles:ua_day')) from dual;


returns 0, smiles A no longer equals itself after molconvert?





Question 3. related bug in cartridge search





Our database CPD table contains Smiles B: OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1





select jc_smiles from cpd where jc_compare(jc_smiles,'OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO','t:e queryType:l') =1;


or


select jc_smiles from cpd where jc_equals(jc_smiles,'OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO') =1;


returns no hit, search with smiles A using jc_compare/jc_equals did not find smiles B





select jc_smiles from cpd where jc_compare(jc_smiles,jcf_molconvert('OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO','smiles:u'),'t:e queryType:l') =1;


returns OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1


Search using canonicalized smiles A finds smiles B.





select jc_smiles from cpd where jc_compare(jc_smiles,'OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1','t:e queryType:l') =1;


returns OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1


okay, search with smiles B finds smiles B in database





Question 4.


I go to the latest demo


http://chemaxon.com/demosite/marvin/index.html





Paste in smiles A, I got a bad-looking structure (See attachment)





Related info. We are not using the latest cartridge, so could you please tried the above examples on your latest version?





Thanks.

ChemAxon aa7c50abf8

01-05-2007 14:52:14

Regarding question No. 1 and No. 2:





jc_equals does a perfect search: http://www.chemaxon.com/jchem/doc/guide/cartridge/cartapi.html#jc_equals .





The jc_compare search type option equivalent to jc_equals is 'p' ('e' means exact search):


Code:
select jc_compare( 'OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO','OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1','t:p') from dual;






And this returns zero.





We will answer the remaining questions soon.





Thanks


Peter

User 8139ea8dbd

01-05-2007 17:01:50

Why perfect match is so much slower than exact search?


We are not using the lateset cartridge, so wonder if it is still true for the latest version. Thanks.





select cpd_sid from cpd where jc_compare(jc_smiles, 'OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1' ,'t:p queryType:l')=1;





Takes 1.8 seconds





select cpd_sid from cpd where jc_compare(jc_smiles, 'OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1' ,'t:e queryType:l')=1;





Takes 0.08 seconds

ChemAxon aa7c50abf8

01-05-2007 20:10:09

Quote:
We are not using the lateset cartridge, so wonder if it is still true for the latest version.
I tried it with both your query and the more selective O=Cc1ccccc1 on the NCI dataset with JChem 3.2.5. Perfect search has been 1.5 to 2 times faster than exact search with either queries. (It was not as much faster with the more selective query as with yours.) (70-80 ms with exact search and 40 ms with perfect search.)





Thanks


Peter

ChemAxon a3d59b832c

01-05-2007 21:01:43

We are still looking at the structures, but it is apparent that in the first smiles the double bond in the ring has a stereo configuration, and in the second smiles this double bond is unspecified. (Probably it is an "accidental" configuration, caused by the slashes representing the stereo configuration of the side-chains.)





This is why there is no match and Marvin's clean method got confused by trying to generate the trans configuration in the 5-membered ring.


Molconvert seemingly removed the confusing stereo information from the ring.





(We have a new clean algorithm in our development codebase, and that cleans the ring nicely with a warning of "TRANS ring bond in ring smaller than size 8 is removed.")

ChemAxon 9c0afc9aaf

03-05-2007 16:18:12

Hi,
Quote:
Why perfect match is so much slower than exact search?
In general these search types may perform the quick pre-filtering (screening) phase on different principles.





In perfect search mode the indexed cd_hash column is used to get a list of possible duplicates.





During exact search screening is performed with fingerprints, these are stored in the memory (structure cache).


From JChem version 3.1.4 the hash code is also used for exact search whenever applicable (no query features present in the query structure).


Screening with the hash code is expected to be a more efficient filter (less candidates left for graph search).





Is your version older than 3.1.4 ?





If


- there is no index on the column cd_hash (it was removed by hand)


- the RDBMS (or the connection to it) is very slow for some reason


then it may provide an explanation.





Best regards,





Szilard

User 8139ea8dbd

03-05-2007 16:42:31

Our cartridge version is 3.1.5


The cd_hash index is in place in the database.


Anyway, we will see if the performance is improved, when we install the latest version.





For this thread, we are most concerned about the discrepancy for smiles A, before and after jc_molconvert. If this is not considered as a bug, one will have to molconvert every query smiles before an exact/perfect search?

ChemAxon a3d59b832c

04-05-2007 11:09:56

yzhou wrote:
For this thread, we are most concerned about the discrepancy for smiles A, before and after jc_molconvert. If this is not considered as a bug, one will have to molconvert every query smiles before an exact/perfect search?
I am sorry about the delay in the answer.


There is a possibility that the problem is with the smiles import (stereochemistry next to the ring closure), but our colleague who develops this is currently unavailable, so he cannot confirm. We will be able to give a definite answer by Monday.

ChemAxon a3d59b832c

07-05-2007 15:13:26

Hi Yingyao,





It is not a smiles import bug. The problem simply is that the double bond stereo information is there in the ring in one smiles but not in the other.





If you use the cxsmiles format instead of smiles, it will retain the stereo information in small rings also (check the extended part "|t:13|" in the output below):
Quote:
$ molconvert cxsmiles -s "OC/C=C/C=C/1\OC(=O)C(=C1)\C=C\CO"


OC\C=C\C=C1/OC(=O)C(\C=C\CO)=C1 |t:13|


In the next major release (3.3) we will solve that all small ring(<=7) double bonds will be always treated as CIS, regardless of how they are specified in the input. (This is the real-life scenario for all these rings due to the constraints of the ring geometry.)





We are checking if there is an immediate possibility using Standardizer that can be helpful for you.





Szabolcs