Is there a reasonable way to process all of the structures in an sdf to determine the major microspecies at a given pH and then export the results to a new sdf?
I know how to determine pKa and major microspecies for an individual structure but am interested in doing this for a data set that could be impractical to do one at a time. I understand that this kind of thing can be done with cxcalc and I see a cxcalc binary and .bat file in MarvinBeans\bin. If this is the right tool, can someone post a line to a tutorial or page with examples?
Yes, you are right. The only way to batch process a data file with many molecules is to use the cxcalc command line tool. In your case you need the following command:
cxcalc majormicrospecies -H 7.0 test.mol -f sdf ms.sdf
This command calculates the major microspecies for the test data file and export it to an SD file. In this case the pH = 7.0.
You can get a list of all cxcalc calculations by typing
I hope this helps.
Thank you for the information.
A few practical questions.
Would I run cxcalc from the windows command line or from cygwin bash, or does it matter?
Would cxcalc be in my path or would I need to have the shell pwd, the cxcalc binary, and the input and output files all in the same directory?
I see there is a cxcalc.bat file in the same directory as the cxcalc binary. What is that used for?
Are the sdf attribute fields from the input file propagated into the sdf output file? If not, What index is used to keep the data in registration?
I was able to get this working using the following command in cygwin bash,
./cxcalc majormicrospecies -H 2.5 -M -f sdf input.sdf > output_results.sdf
with pwd being the marvin install /bin directory.
I'm sure if I put the bin directory in my path I could run this from any location but I didn't try that.
If this is run without the redirect,
./cxcalc majormicrospecies -H 2.5 -M -f sdf input.sdf output_results.sdf
you get a "file not found" error for output_results.sdf. There doesn't seem to be any explicit way to declare input and output file name arguments. If you run without the redirect and the output_results.sdf argument, the output is to the terminal.
Overall, this works well and processed ~2500 compounds in a few seconds. All data in the input sdf is propagated to the output sdf, including the molfile first line, which is important.
There appears to be an untrapped exception when the input file has a counterion. The molfile for the major microspecies is output to the sdf, but nothing else. None of the sdf attribute fields from the input sdf are included with the molfile and the > <_MOLCOUNT> field is also missing. If counterions are removed, this issue goes away. In most cases, it it best practice to leave input files untouched since they may represent the exact structure (including formulation) of the compound that went into the test tube. This makes it problematic to have to remove counterions and thus introduce the potential of permanent information loss.
I'm not sure what causes this issue since Marvin is actually processing the compound and creating the correct molfile output. There is no warning issued to stderr when this occurs, so I believe this to be untrapped.
In my last response I didn't give you the full details of the cxcalc input/output options. Sorry for that. To correct myself
I would like to mention that cxcalc follows the following syntax for its options:
cxcalc general_options input plugin_options ...
Specifying the output file is a general option and can be done like this:
cxcalc -o output.sdf input.mol majormicrospecies -H 7.0 -f sdf
This will create and save the output into the output.sdf file in SD format. Regarding calculating for compounds with coutner ions it is a bug ! The input information is lost on the output. Thanks for the reporting. We will fix it and notify you when the fix is ready.
You can find all option informations by typing cxcalc -h.
We have made the necessary bugfixes for your issue. Please download the newest version of cxcalc and test it.