searching

User 5f37dcad67

02-05-2005 04:23:52

Dear all,


I am new user of Jchem. Please let me know how to search similiar molecules of a query molecule from a multi .sdf or multi .mol2 file. I am using Jchem in Red Hat Linux ver 11.


Regards,





Subbarao

ChemAxon a3d59b832c

02-05-2005 05:39:19

Hi Subbarao,





You need to use the jcsearch program for that:





http://www.jchem.com/doc/user/Jcsearch.html





Unfortunately we do not support multi mol2 files yet, but sdf is supported.





A few examples for usage:





Simple similarity search:





Code:
jcsearch -t:i -q query.mol structures.sdf






This one illustrates dissimilarity threshold, smiles query and piping the results to marvinview:





Code:
jcsearch -t:i:0.2 -q 'CC1CCCN(C1)C(=O)CNC(N)=O' structures.sdf | mview '-' &






Best regards,





Szabolcs

User 5f37dcad67

05-05-2005 06:43:59

Hi Szabolcs,





Thanks for the quick reply. I could do the search successfully. I have another query in the same lines. I tried searching for molecules with thersholds 0.1, 0.2, 0.3 etc., How do I decide which thereshold is good enough for me? What are the criterias of choosing a particular threshold?





Thanks in advance,





Regards,





Subbarao

ChemAxon efa1591b5a

06-05-2005 08:14:30

Hi Subbarao,





there is a tool in the Screen package that can be used to suggest reasonable threshold values.


This tool is a generic metric optimizer whose aim is to increase the enrichment ratio of virtual screening. Among many parameters that it can optimize threshold is also calculated.





The optimizer is available via a commandline tool called optimizemetrics. To run it you need three input files. First a random molecule set, for instance a subset of your target library you want to search for similar structures. Then you also need a set of structures that exhibit some sort of similarity. This can be structural or pharmacophoric or biological etc. You don't need too many of them, though 20 or so is a required minimum. Split these 20 structures in two independent subsets, the training set and the query set. Structures in the training set will be used as spikes, that is, they will be mixed in the random set. The aim of the optimization is to retrieve a given predefined percentage of these spikes.





Then decide which molecular descriptor you want to use, I suppose this is the ChemAxon's chemical fingerprint, and also the appropriate similarity/dissimilarity metric, I reckon this could be Tanimoto in your case.





It is recommended to use a hypothesis fingerprint during optimization, rather than individual fingerprints of the structures in the query set. You have three options here, Minimum, Median or Average. If your query set is not too diverse, use Minimum (this is often called consensus), otherwise I recommend Median.





Finally, you need to decide the spike retrieval percentage, by default this is 80%.





Bearing all the above considerations in mind, the command you need to execute is as follows:





optimizemetrics randomset.smiles trainingset.smiles queryset.smiles -k CF -c cfp.xml -M Tanimoto -t -H





In this commandline -k CF tells the optimizer to use chemical fingerprints for similarity calculation. -c cfp.xml defines the configuration file that specifies the parameters for the chemical fingerprint. You find a sample cfp.xml file in the examples/config folder in your JChem installation directory. This is an XML file that you can edit with a simple text editor. You may need to edit it to make sure that parameters are the same as in your similarity search.





-M Tanimoto specifies the metric, -t tells optimizer to set threshold only according to spikes percentage (80%, by default), and -H tells that hypothesis of the queryset has to be used. By default this is Minimum.





To play with some options, you can try this:





optimizemetrics randomset.smiles trainingset.smiles queryset.smiles -k CF -c cfp.xml -M Tanimoto -t -H Median -p 90





Here a median hypothesis is used and the spikes retrieval percentage is increased to 90%.





The above two command dump results (in XML) to the standard output. You need to look for a line like this (towards the end of output):





<ParametrizedMetric Name="queryset.smiles by Tanimoto" ActiveFamily="queryset.smiles" Metric="Tanimoto" Threshold="0.91">





The last value, Threshold is what you need. Be aware that this is a dissimilarity threshold, thus substract it from 1 to get a similiarity threshold.





I hope this helps. I know my answer is quite long but optimizemetrics is a fairly complex piece of software. Please get back to the forum if something is not discussed appropriately or in case you have further questions.





Regards,


Miklos

User 5f37dcad67

06-05-2005 08:32:58

Dear Miklos,


Thank you so much for a detailed reply. I am using it staright away. If I have a problem i will surely come back to the forum.





Regards,





Subbarao

User 5f37dcad67

10-05-2005 13:46:15

Hi,


As you had said I did the optimizemetrics. Also I got to know that the command tool "optimize" makes all possible combination of descriptor and metrics and tells us in which metric the enrichment and Active Hit Distribution are the highest. So the corresponding threshold and the metric have to be used. I was very confused when I saw my results since I couldnt see any enrichment or what so ever. For your information I am attaching the example-statistics.stat file along with this reply.


The original library file has 84 molecules from which I took 64 molecules as training set and the rest 20 molecules as query set. Of these I took one molecule randomly to give along with the command.





Please tell me what has to be done. Please let me know if I have done anything wrong.





Thanks in advance,





Regards,





Subbarao

ChemAxon efa1591b5a

10-05-2005 14:05:46

Hi Subbarao,





this is known bug in hitstatistics, but there is a workaround. The memory safe mode has to be used, that is the -f flag needs to be specified in the command line. Please check the optimize script or batch file to make sure that hitstatistics is called with -f.





This -f flag was added to the optimize script in JChem 2.3.2.


Which version are you using?








Regards,


Miklos

User 5f37dcad67

10-05-2005 16:45:18

Dear Miklos,


Thank you for the reply. I am using Jchem 3.0.3 version? I am getting a different result than the one I was getting earlier without using the -f flag.





Thanks once again.





Subbarao

ChemAxon efa1591b5a

11-05-2005 08:11:39

Thanks for the info. I'll check what went wrong in that release: optimize and other related scripts and batch programs should call hitstatistics with the -f flag.


Regards,


Miklos

User 5f37dcad67

11-05-2005 09:08:31

Hi,


After I ran optimizemetric for all the available metrics. I got the statistics file (which I have attached along with this reply) which showed that the descriptor PF and Euclidean metric showed highest enrichment. So when I did jarp and finally saw the results there was 1 cluster with 2 neighbours. When I tried libmcs it showed 5 distinct clusters. I am confused by what I should do and which is a good result...





Please help...





Regards,





Subbarao