jc_compare with t:ff tautomerSearch:y does not return .

User 7f33ec9a5c

21-11-2012 03:57:33

Hi,

We need to match all forms of a small molecule, including tautomers, sterioisomers and salt forms. The options 't:ff tautomerSearch:y charge:i radical:i stereoSearchType:i' work well with jc_compare for many structures, but it hangs on the particular smiles 'Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1' when passed as both arguments as shown below:

--==========================================
--Hangs and never returns
select jcf.compare('Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1',
'Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1', 't:ff tautomerSearch:y charge:i
radical:i stereoSearchType:i') from dual

--==========================================
--exhibits the same hang and never return behavior
select *
from structure
where jc_compare(s_smiles, 'Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1' , 't:ff tautomerSearch:y charge:i radical:i stereoSearchType:i') =1;

-- where s_smiles is indexed with jc_index
-- and 'Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1' is contained
-- in the s_smiles column of structure.
--==========================================

On the structure table with jc_idx, we've noticed a drastic difference in the 't:ff tautomersearch:y' performance, I suspect that other conditions with similar structures are causing slowing, as some t:ff tautomersearch:y queries take over an hour, while others take less than a minute.

~Mike

ChemAxon 4a2fc68cd1

21-11-2012 09:05:49

Hi Mike,


We are aware of the performance issues with tautomer search. In fact, we currently work on this and upcoming JChem releases will provide signifcant improvements.


I checked the structures you sent. For me, this search does not hang, but it is indeed quite slow (on my PC, it takes about two minutes). Without 'tautomerSearch:y', it takes only one second.


Could you check if this search really hangs or is "just" rather slow?


Best regards,
Peter

ChemAxon 9c0afc9aaf

21-11-2012 17:04:59

Mike,


Your query has about 660 tautomers (combinatotial effect of multiple sites ), which takes a couple of minutes just to enumerate - you can try this in Marvinsketch.


Then we have to search the database with each query individually - so it can take a long time to finish.


 I have mentioned this potential drawback if you have such structures - as Peter noted we are trying to make this more efficient in the future.


I am aware that you are using this full fragment search for duplicate search purposes:


https://www.chemaxon.com/forum/ftopic9790.html


We recommend duplicate filterint table/index option (tdf:y) is recommended to avoid this performance problem during diplicate search.


The small price to pay is that if you do not want to consider tautomers during duplicate search, you have to explicitly specify "tautomerSearch:n" for duplicate (t:d) searches - but as mentioned for duplicate search only, other searches are not affected.


So these are the two choices now, or waiting to see how much the speed improves in the next versions.


Best regards,


Szilard

User 7f33ec9a5c

21-11-2012 19:32:22










pkovacs84 wrote:

Could you check if this search really hangs or is "just" rather slow?



Hi Peter,


YES! You are correct, my mistake, it takes 3 minutes on our production server:


SQL> set timi on
SQL> select jcf.compare('Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1', 'Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1', 't:ff tautomerSearch:y charge:i radical:i stereoSearchType:i') from dual;

JCF.COMPARE('OC1CCC(CN=NC2NNC(SCC(=O)NC3NNC(SCC=C)S3)[NH]2)CC1','OC1CCC(CN=NC2NN
--------------------------------------------------------------------------------
1

Elapsed: 00:03:02.04


~mike

User 7f33ec9a5c

21-11-2012 19:48:41

Wanted to clarify two points that Szilard makes below.




 











Szilard wrote:

Mike,


Your query has about 660 tautomers (combinatotial effect of multiple sites ), which takes a couple of minutes just to enumerate - you can try this in Marvinsketch.


 


We recommend duplicate filterint table/index option (tdf:y) is recommended to avoid this performance problem during diplicate search.


 



660? Wow! We were enumerating the tautomers using jc_evaluate_x and chemical terms as below, and we only got 3.  Are we missing a parameter in our query?


SQL> select jc_evaluate_x('Oc1ccc(CN=Nc2nnc(SCC(=O)Nc3nnc(SCC=C)s3)[nH]2)cc1','chemTerms:tautomers()') tautomers from dual;

TAUTOMERS
--------------------------------------------------------------------------------
OC1=CC=C(CN=NC2=NNC(SCC(=O)NC3=NN=C(SCC=C)S3)=N2)C=C1
OC1=CC=C(CN=NC2=NC(SCC(=O)NC3=NN=C(SCC=C)S3)=NN2)C=C1
OC1=CC=C(CN=NC2=NN=C(N2)SCC(=O)NC2=NN=C(SCC=C)S2)C=C1


I guess where I get baffled by this is that even if 660 tests are required, 3 minutes seems like a long time.  Of course if you need to do 660x660= 453,000 tests, then 3 minutes is starting to make more sense?


Also:


Our index of jc_idx_idxtyp was created with the tdf:y option you suggested, and the query performance described above is using the tautomer index.   This index has been performing really well for us, looking for tautomers with t:d is very very fast, and we really like it.   We are only experiencing slowness with t:ff searches.

ChemAxon 9c0afc9aaf

21-11-2012 20:11:02

 


660? Wow! We were enumerating the tautomers using jc_evaluate_x and chemical terms as below, and we only got 3.  Are we missing a parameter in our query?

Well, I have tried in MarvinSketch, and it turns out the settings for tautomer generation were quite loose.


Setting back to the default got 40, a bit closer, not sure what is the difference from the MarvinSketch default in the cartridge faunction. I'll look into this.


I guess where I get baffled by this is that even if 660 tests are required, 3 minutes seems like a long time.  Of course if you need to do 660x660= 453,000 tests, then 3 minutes is starting to make more sense?

With the tatuomer search option think of substructure (or full fragment in this case) searching the whole database separately with each query (it's pretty close to what we do now, with some added optimizations).


THe number of "tests" (graph search) is not computed this way, its the sume of "screened" molecules for each query, which depends on the query, the size of the database and the molecules in it.


In general non-suplicate searches are not as fast as duplicate searches. 


Our index of jc_idx_idxtyp was created with the tdf:y option you suggested, and the query performance described above is using the tautomer index.   This index has been performing really well for us, looking for tautomers with t:d is very very fast, and we really like it.   We are only experiencing slowness with t:ff searches.

OK, I think I have mixed up "t:ff" with a "t:f" approach for duplictes considered earlier with "tdf:n" option, now I know which solution you are currently using, sorry for the confusion.


Best regards,


Szilard 

User 7f33ec9a5c

21-11-2012 20:25:57










Szilard wrote:

Mike,


Your query has about 660 tautomers (combinatotial effect of multiple sites ), which takes a couple of minutes just to enumerate - you can try this in Marvinsketch.



Hi Szilard, I'm only getting 4 tautomers.  


Can you check my settings in the images below and tell me what I am doing wrong.  I'm sure it's user-error, but I don't know what to click.

ChemAxon 9c0afc9aaf

22-11-2012 00:37:13

Have you tried to scroll down in the 2x2 matrix display ? ;)


 

User 7f33ec9a5c

03-12-2012 17:19:23










Szilard wrote:

Have you tried to scroll down in the 2x2 matrix display ? ;)


 



Well that was silly of me.  Yes! there are 604 Tautomers there, user error, just scroll down!


We still need to solve the problem of how to search for all tautomers and salt-forms of tautomers in a reasonable amount of time.  Again, using the t:d option returns nearly instantly, where t:ff takes an exceedingly long time for tautomers, even though we are using a tautomer index.  

ChemAxon 9c0afc9aaf

04-12-2012 04:56:02


Mike,


We still need to solve the problem of how to search for all tautomers and salt-forms of tautomers in a reasonable amount of time.  Again, using the t:d option returns nearly instantly, where t:ff takes an exceedingly long time for tautomers, even though we are using a tautomer index.  


Again, the tautomer index option (tdf:y) is only effective for duplicate search (t:d).


The full fragment t:ff option with tautmers will behave similarly to a substructure search and will enumerate several queries.


As Peter mentioned above we are planning to make some improvements to our tautomer search in the future.


Until then some workarounds may be feasible, not sure if desirable ...


1. Use a search standardization (index parameter) that removes the salt (keepLargest).


This does not effect the structure in the DB, just the index data.


Then you can search with t:d instead of t:ff, but of course you cannot distiguish between the same structure occuring with different salts.


2. Maintain a duplicate column of your structure with the salt removed (e.g. by trigger).


You can use this for the above purpose (t:d for largest fragment instead of t:ff), otherwise the original column still has the salt form.


3. Store structures and salts separately. Registration systems usually do this, but I guess this would be the biggest rewrite in your code.


These are all the things I can think of - apart from waiting for the improvements of course. 


Best regards,


Szilard

ChemAxon 9c0afc9aaf

04-12-2012 05:13:06

PS:


Regarding the discrepancy between the tautomer count above in the cartridge and in Marvin:


The "tautomers" function only calculate Dominant tautomers, and the structure has 3 of them.


http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#dominanttautomersex


AFAIK we generate all tautomers during the search.

ChemAxon a3d59b832c

04-12-2012 08:15:03

Hi Mike,


In fact, we have been working on improving tautomer
search in the last few months.

With JChem 5.12, we will soon release improvements in stereo
handling and performance of single molecule tautomer matching with
full structure, full fragment and duplicate search types.



Unfortunately, the database extension of these tautomer search
speedups could not fit into our time box for 5.12.

But it will be the first thing we develop for the JChem 6.0
release (due April, 2013).



(P.S. These speedups are the extension of the method that we
already use for duplicate searching of tautomer duplicate
filtering tables in the current version. With this method, tautomer duplicate searching is very fast in
these tables.


Furthermore, we plan to move away from tdf tables. The new, fast
methods in 6.0 will be available in all JChem tables and indexes.)


 


Best regards,


Szabolcs

User 7f33ec9a5c

10-12-2012 22:08:15

Szabolcs,


Thank you for the update.  For most cases t:ff is working well for us, it is only for a few edge-cases that the t:ff search slows down.  I am glad to hear you plan optimizations for the 6.0 release.


I added another post today with few other unusual issues we have been having with t:ff, which again are edge-cases that rarely happen.


Thank you,


~mike