cxcalc Performance

User d27b99d458

06-01-2014 11:34:40

 


 


Hello,


 


I am calculating several molecular descriptors using cxcalc. I am using a dataset consisting of 4988 molecules ('.mol' files). However these calculations are running since two months ago and have not finished yet. I would expect it to be a so much more faster process because it is working above a very small dataset. Any idea of what could be happening?


The descriptors that I am using are described int the attached file.


 


Thanks in advance,


 


Raquel


 

ChemAxon d51151248d

07-01-2014 10:00:12

Dear Raquel, 


as you have lots of descriptors, it is impossible to find out which of them takes that much time to compute without an output. To keep track of your calculations you can use the logging options of cxcalc. You can set the logging file path, the level of logging, and the logging options. For you the timelimit=<limit in ms> option would be the best. To have the full list of logging options, use cxcalc -h.


Best wishes, 


Daniel

User 677b9c22ff

15-01-2014 03:27:23

Hi,


its always good to start with a small random benchmarkset and also provide here some insights about your used hardware, software and input molecules. If they are small molecules <2000 Da, your computer is too slow. If they are large proteins, even a fast computer may lag. So in your case you would do a run for each descriptor for lets say 50 compounds, then 100, then 500 and extrapolate the time you need. Its also good to set time limits in case the program is not able to calculate a fast solution. Its better to filter such molecules out for example by sorting your sets according to MW or complexity.


A small benchmark on 10,000 NCI compounds (2x your size)  needed 14 minutes (instead of 2 months) on a 16 core machine (3.3 GHz) file output on a ramdisk with cxcalc (v6.0). All processes were run in parallel, but half of the time was actually spent on the tetrahedralstereoisomers calculation. Further breakdown revealed that the time needed was:


all other descriptors < majortautomer < stereoisomers < tetrahedralstereoisomers


Some cxcalc calculations maybe parallelized, some are not (such as tetrahedralstereoisomers). Under LINUX you can use make or GNU parallel or PPSS. Under WIndows nmake or the start command in a batch file:


    


    fast-calc.bat:



start /B cxcalc doublebondstereoisomers -f sdf NCI-10000.smi > doublebondstereoisomers.txt
start /B cxcalc stereoisomers -f sdf NCI-10000.smi >  stereoisomers.txt
start /B cxcalc tetrahedralstereoisomers -f sdf NCI-10000.smi >  tetrahedralstereoisomers.txt
start /B cxcalc logd NCI-10000.smi  > logd.txt
start /B cxcalc chargedistribution NCI-10000.smi  > chargedistribution.txt
start /B cxcalc msacc NCI-10000.smi > msacc.txt
start /B cxcalc msdon NCI-10000.smi > msdon.txt
start /B cxcalc generictautomer -f sdf NCI-10000.smi  > generictautomer.txt
start /B cxcalc majortautomer -f sdf NCI-10000.smi  > majortautomer.txt
start /B cxcalc canonicalresonant -f sdf NCI-10000.smi  > canonicalresonant.txt
start /B cxcalc enumerations -f sdf NCI-10000.smi  > enumerations.txt



 


I added the NCI-10000.smi below. Again to further improve the speed one would also split the data sets for the stereosiomers into smaller chunks, that would trim the time below. For example chargedistribution needs only 15 seconds and logD only 20 seconds for 10,000 molecules here.The cxalc parallelism (if there is any) also breaks down on some molecules simply due to heap space or potentially I/O constraints or its actually not fully threaded.


Regarding some of your "descriptors" they are actually molecular result files (such as the stereoisomers) and not the stereoisomer count number. So here it may be recommended to make an estimation based on the formula 2^n (n=chiral centers + double bonds ). That is a rough estimate, but assume you have a sugar with 15 chiral centers you would create 32768 stereoisomer sugars. So use the computational cheap chiral center count for each of you molecules and then estimate the number of isomers.


So for example in case of Hexa-N-acetylchitohexaose (FUHDMRPNDKDRFE-LPUYKFNUSA-N) a topology analysis will tell you you have a chiral center count of 29 AND  double bond count = 6 that means 2^(29+6) = 34,359,738,368 stereoisomers. Now the question comes, could you actually handle an estimated 350 TByte (based on a 10k mol file)? If not its easier to just count or even estimate the stereoisomers instead of generating 34 billions of them (as an example). If you use the inbuilt cxcalc functions make sure to override the default settings with the switch -m, --maxstereoisomers. Plus for such extremes even the generation is too slow, so just estimating the numbers would be fine.


With the small benchmark sets with increasing compound numbers you would not wait 2 months, but you would have a good estimate how long things will take or if its even feasible to go ahead.




Cheers


Tobias


 


 



User d27b99d458

16-01-2014 09:22:46

 


 


Hi,


 


Thank you for the answers, they are very helpful to understand the problem. I will calculate only the counts of stereoisomers and chiral centers, and use the flag timelimit to control the amount of time spent in each calculation.


 


Thank you so much,


 


Raquel