Tautomerize is 80x slower for some compounds vs. others

User 81a38f9467

02-08-2011 00:52:57

I am using the standalone Standardizer 5.5.0.1.


We have a service that calls the standardizer and noticed that it was much slower for some sets of compounds versus others.  After investigating, we found that the slowest step in our standardization is the Tautomerize process.  As the attached screen shots depict, the set of 27 slow compounds takes two seconds per compound just for the tautomerize step, while a separate set of 1170 compounds takes a combined total of 29 seconds, or 80 times faster per compound.  (All times were reproducible.)  Even the subset of the 27 highest MW compounds from the set of 1170 took only 3 seconds total.


We have additional steps in the standardization and some much large data sets to deal with, so optimizing the process for speed is important to us.


Can you explain why the 27 slow compounds pass through tautomerization so much slower than others?  Understanding is nice, but what we really want is a faster process.


I'm attaching files with the slow 27 and the fast 27 compounds.

User 851ac690a0

06-08-2011 14:22:34

Hi, 


Thank you for the feedback. I repeated your test.


The  calculation time ratio of the "slow vs. fast 27" compounds is  approx. 18. This is in agreemnet with your findings. 


Slow_27 /  Fast_27  = 54/3   ~  18     (public version of standardizer) 


With the actual trial version ,available only on my PC ,the calculation  time decreased. 


The above  ratio is:   Slow_27 /  Fast_27   ~  2    (not public version of standardizer)


This version is not public yet but  it will be available soon.


The "to  know  how to do "   requires clear understanding of  the tautomerization processes since this is the only guarantee to improve the speed of the calcualation.    A couple of  versions  earlier the calculation time was very large for this type of compounds, sometimes more than 10 minutes. Now we talk about seconds. 


 


Jozsi

User 8d34d3a066

07-02-2012 17:13:38

Hi,


Are you able to confirm if the predicted speed increases are available now? We are experiencing slowness with tautomerization and I would like to know if we have the same problem as outlined above or if I should post some more information (examples of slow molecules etc.).


thanks


Richard

User 851ac690a0

13-02-2012 16:35:55

Hi,


 


Tha standard canonical tautomer calculation was accelerated significantly. This fast version is 5.9 will be available in this month.     


 


Jozsi

User 677b9c22ff

08-03-2012 01:25:19

 


Hi,


just to comment on this, because the values were relative so  it was certainly true,


but now I used the and Standardizer64 version 5.9 with a quad-core 8 thread Core i7-2760QM with modest 2.4 Ghz. Both files run either at  0 sec or 1 second max.


Process completed.

Molecules standardized:     27
Overall progress:     100%
Time elapsed:         0h 0m 0s (fast sdf version)
------------------------------------
Process completed.

Molecules standardized:     27
Overall progress:     100%
Time elapsed:         0h 0m 1s (slow sdf version)



Issue solved I guess. Anyway I love the threading and this is how it crunches through the molecules: I love it! Congratulations, nice implementation, fast, fast , fast.


I additionally used the very diverse NCI test set with 250,251
2D structures in SDF format.

WARNING: This is a 90 MB file that
uncompresses to about 982 MB! 
it just runs through it 82 mols/sec (48min total for 250241 molecules) , there are some molecules included that bring all 8 threads to halt.


Source: http://cactus.nci.nih.gov/download/nci/


Actually there were a bunch of errors too:


#9 Error at molecule No. 110111 Array index out of range: 20

#10 Error at molecule No. 120522 Array index out of range: 11


    at chemaxon.alchemist.standardizer.StandardizerAlchemistTask.calculate(StandardizerAlchemistTask.java:209)
    at chemaxon.alchemist.AlchemistTask$ActualTask.<init>(AlchemistTask.java:178)
    at chemaxon.alchemist.AlchemistTask$1.construct(AlchemistTask.java:69)
    at chemaxon.alchemist.utils.SwingWorker$2.run(SwingWorker.java:110)
    at java.lang.Thread.run(Unknown Source)
   



Cheers


Tobias

User 851ac690a0

08-03-2012 08:43:59

Hi,


 


Thank you for testing speed.


I will fix the "NCI" 's bug.


Jozsi


 

User 7910dcb734

18-11-2013 19:44:49

Hi,


I'm curious if there are any benchmarks for the Standardizer's tautomerise function. I'm currently trying to incorporate a standardized version of the ChEMBL dataset (https://www.ebi.ac.uk/chembl/), and the 1.2M compounds in Release15 (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_15/) are at an estimated 20 hours to standardize, using the standalone standardizer.


I've narrowed the timesink down to the tautomerise function.


Should I expect these sort of timescales? (I'm using version 6.1.2 of ChemAxon software.)


 


Cheers,


 


Brendan


 


 

User 677b9c22ff

19-11-2013 03:26:51

Hi,


be aware that there are newer releases of ChEMBL (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/).


The CHEBI DB has 1.2 million compounds, 3000 of them are larger than 2000 Dalton MW. SO unless you are a protein, peptide researcher I would programmatically excluded them. The 3000 compounds are attached as extremely-large-compounds.cxsmarts file as ZIP below. Try those 3000 and they will take almost 90% or 18 hours of your computational process to tautomerize. 


With the new 6.0 version your tautomerization speed is dependent on:


1) Processor speed (best >3 GHZ) latest 2013 chip technology


2) Processor or CPU count (best 16- 40 CPUs) or at least 24-32 threads.


3) DISK speed (Best via RAMDISK or SSD RAID array)


4) Memory (unless you use the compiled WIN.EXE) in order to avoid costly JAVA garbage collection


assign 40 to 80 GByte (not MByte) or more heap space. 


 


One Million structures from the CHEMBL16 took around 40 minutes on my system (all <2000Da).


To put that into perspective the first 500,000 compounds took around 7 minutes!


That's 1200 compounds per second for the first 500k, throughput drops of course with larger MW.


(Dual CPU Xeon E5-2687W with 196 Gbyte RAM, 40 Gbyte assigned as RAMDISK drive, 40G JAVA heap size).


 


I also added a PDF that explains how to sort according to size or use other tools as filter to exclude those large compounds.


 


Cheers


Tobias

User 7910dcb734

19-11-2013 10:35:02

Hi Tobias,


Many thanks, that's an extremely useful and comprehensive post. I'll get back to you when I've investigated some more after implementing some of the changes this implies.


Best wishes,


Brendan