Technical Support Forum Index
Technical Support Forum
Access ChemAxon scientists and developers here. For registration and login issues contact website support.

Support Ticket System is replacing forum

This forum was converted into a searchable archive. You cannot add posts here any more. For support please use our new Ticket System.

Create your first ticket
Tautomerize is 80x slower for some compounds vs. others
To watch this topic for replies  Register (enables digests) or give email address:
This topic is locked: you cannot edit posts or make replies.
Display posts from previous:   
    View previous topic :: View next topic    
Author Message
Tim

Joined: 19 Feb 2010
Posts: 59

View user's profile

Back to top
Link to postPosted: Tue Aug 02, 2011 1:52 amPost subject: Tautomerize is 80x slower for some compounds vs. others Reply with quote

I am using the standalone Standardizer 5.5.0.1.

We have a service that calls the standardizer and noticed that it was much slower for some sets of compounds versus others.  After investigating, we found that the slowest step in our standardization is the Tautomerize process.  As the attached screen shots depict, the set of 27 slow compounds takes two seconds per compound just for the tautomerize step, while a separate set of 1170 compounds takes a combined total of 29 seconds, or 80 times faster per compound.  (All times were reproducible.)  Even the subset of the 27 highest MW compounds from the set of 1170 took only 3 seconds total.

We have additional steps in the standardization and some much large data sets to deal with, so optimizing the process for speed is important to us.

Can you explain why the 27 slow compounds pass through tautomerization so much slower than others?  Understanding is nice, but what we really want is a faster process.

I'm attaching files with the slow 27 and the fast 27 compounds.




 Filename: ScreenShot640.gif    Filesize: 57.91 KB    Viewed: 37778 Time(s)
 Description:  
ScreenShot640.gif

 Filename: Fast_compounds_final_27.sdf    Filesize: 140.09 KB    Downloaded: 277 Time(s)
 Description:  

 Filename: Slow_compounds_final_27.sdf    Filesize: 83.66 KB    Downloaded: 290 Time(s)
 Description:  
Jozsef
ChemAxon personnel
Joined: 25 May 2004
Posts: 568

View user's profile

Back to top
Link to postPosted: Sat Aug 06, 2011 3:22 pmPost subject: Reply with quote

Hi, 

Thank you for the feedback. I repeated your test.

The  calculation time ratio of the "slow vs. fast 27" compounds is  approx. 18. This is in agreemnet with your findings. 

Slow_27 /  Fast_27  = 54/3   ~  18     (public version of standardizer) 

With the actual trial version ,available only on my PC ,the calculation  time decreased. 

The above  ratio is:   Slow_27 /  Fast_27   ~  2    (not public version of standardizer)

This version is not public yet but  it will be available soon.

The "to  know  how to do "   requires clear understanding of  the tautomerization processes since this is the only guarantee to improve the speed of the calcualation.    A couple of  versions  earlier the calculation time was very large for this type of compounds, sometimes more than 10 minutes. Now we talk about seconds. 

 

Jozsi

Richard

Joined: 07 May 2010
Posts: 14

View user's profile

Back to top
Link to postPosted: Tue Feb 07, 2012 6:13 pmPost subject: Reply with quote

Hi,

Are you able to confirm if the predicted speed increases are available now? We are experiencing slowness with tautomerization and I would like to know if we have the same problem as outlined above or if I should post some more information (examples of slow molecules etc.).

thanks

Richard

Jozsef
ChemAxon personnel
Joined: 25 May 2004
Posts: 568

View user's profile

Back to top
Link to postPosted: Mon Feb 13, 2012 5:35 pmPost subject: Reply with quote

Hi,

 

Tha standard canonical tautomer calculation was accelerated significantly. This fast version is 5.9 will be available in this month.     

 

Jozsi

Tobias

Joined: 26 Jan 2005
Posts: 580

View user's profile

Back to top
Link to postPosted: Thu Mar 08, 2012 2:25 amPost subject: Reply with quote

 

Hi,

just to comment on this, because the values were relative so  it was certainly true,

but now I used the and Standardizer64 version 5.9 with a quad-core 8 thread Core i7-2760QM with modest 2.4 Ghz. Both files run either at  0 sec or 1 second max.

Process completed.

Molecules standardized:     27
Overall progress:     100%
Time elapsed:         0h 0m 0s (fast sdf version)
------------------------------------
Process completed.

Molecules standardized:     27
Overall progress:     100%
Time elapsed:         0h 0m 1s (slow sdf version)

Issue solved I guess. Anyway I love the threading and this is how it crunches through the molecules: I love it! Congratulations, nice implementation, fast, fast , fast.

I additionally used the very diverse NCI test set with 250,251 2D structures in SDF format. WARNING: This is a 90 MB file that uncompresses to about 982 MB!  it just runs through it 82 mols/sec (48min total for 250241 molecules) , there are some molecules included that bring all 8 threads to halt.

Source: http://cactus.nci.nih.gov/download/nci/

Actually there were a bunch of errors too:

#9 Error at molecule No. 110111 Array index out of range: 20

#10 Error at molecule No. 120522 Array index out of range: 11


    at chemaxon.alchemist.standardizer.StandardizerAlchemistTask.calculate(StandardizerAlchemistTask.java:209)
    at chemaxon.alchemist.AlchemistTask$ActualTask.<init>(AlchemistTask.java:178)
    at chemaxon.alchemist.AlchemistTask$1.construct(AlchemistTask.java:69)
    at chemaxon.alchemist.utils.SwingWorker$2.run(SwingWorker.java:110)
    at java.lang.Thread.run(Unknown Source)
   

Cheers

Tobias




 Filename: Capture-thread.PNG    Filesize: 6.03 KB    Viewed: 37383 Time(s)
 Description:  
Capture-thread.PNG
Jozsef
ChemAxon personnel
Joined: 25 May 2004
Posts: 568

View user's profile

Back to top
Link to postPosted: Thu Mar 08, 2012 9:43 amPost subject: Reply with quote

Hi,

 

Thank you for testing speed.

I will fix the "NCI" 's bug.

Jozsi

 

Brendan

Joined: 12 Jul 2012
Posts: 44

View user's profile

Back to top
Link to postPosted: Mon Nov 18, 2013 8:44 pmPost subject: Reply with quote

Hi,

I'm curious if there are any benchmarks for the Standardizer's tautomerise function. I'm currently trying to incorporate a standardized version of the ChEMBL dataset (https://www.ebi.ac.uk/chembl/), and the 1.2M compounds in Release15 (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_15/) are at an estimated 20 hours to standardize, using the standalone standardizer.

I've narrowed the timesink down to the tautomerise function.

Should I expect these sort of timescales? (I'm using version 6.1.2 of ChemAxon software.)

 

Cheers,

 

Brendan

 

 

Tobias

Joined: 26 Jan 2005
Posts: 580

View user's profile

Back to top
Link to postPosted: Tue Nov 19, 2013 4:26 amPost subject: Reply with quote

Hi,

be aware that there are newer releases of ChEMBL (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/).

The CHEBI DB has 1.2 million compounds, 3000 of them are larger than 2000 Dalton MW. SO unless you are a protein, peptide researcher I would programmatically excluded them. The 3000 compounds are attached as extremely-large-compounds.cxsmarts file as ZIP below. Try those 3000 and they will take almost 90% or 18 hours of your computational process to tautomerize. 

With the new 6.0 version your tautomerization speed is dependent on:

1) Processor speed (best >3 GHZ) latest 2013 chip technology

2) Processor or CPU count (best 16- 40 CPUs) or at least 24-32 threads.

3) DISK speed (Best via RAMDISK or SSD RAID array)

4) Memory (unless you use the compiled WIN.EXE) in order to avoid costly JAVA garbage collection

assign 40 to 80 GByte (not MByte) or more heap space. 

 

One Million structures from the CHEMBL16 took around 40 minutes on my system (all <2000Da).

To put that into perspective the first 500,000 compounds took around 7 minutes!

That's 1200 compounds per second for the first 500k, throughput drops of course with larger MW.

(Dual CPU Xeon E5-2687W with 196 Gbyte RAM, 40 Gbyte assigned as RAMDISK drive, 40G JAVA heap size).

 

I also added a PDF that explains how to sort according to size or use other tools as filter to exclude those large compounds.

 

Cheers

Tobias




 Filename: compd-large.jpg    Filesize: 38.39 KB    Viewed: 36430 Time(s)
 Description:  
compd-large.jpg

 Filename: CPU.jpg    Filesize: 92.77 KB    Viewed: 36430 Time(s)
 Description:  
CPU.jpg

 Filename: tautomerize.pdf    Filesize: 845.23 KB    Downloaded: 411 Time(s)
 Description:  

 Filename: extremely-large-compounds.zip    Filesize: 261.87 KB    Downloaded: 97 Time(s)
 Description:  These are 3000 compounds with 2000 Da that take several hours to tautomerize.
Brendan

Joined: 12 Jul 2012
Posts: 44

View user's profile

Back to top
Link to postPosted: Tue Nov 19, 2013 11:35 amPost subject: Reply with quote

Hi Tobias,

Many thanks, that's an extremely useful and comprehensive post. I'll get back to you when I've investigated some more after implementing some of the changes this implies.

Best wishes,

Brendan

This topic is locked: you cannot edit posts or make replies.
Page 1 of 1


To watch this topic for replies   Register (enables digests) or give email address  
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum