I am using JChemSearch class to do my similarity searches. I would like to see the top 100 results. Now that the searcher has deprecated setSimilarityThreshold, I have to use setDissimilarityThreshold. Now the issue is this. When I get a result set of about 100 hits for dissimilarity threshold of 0.3, the searcher orders them starting at 0.3 and below. When translated into a similarity search (similarity threshold of 0.7), this would mean that the searcher gives me 100 compounds 0.7 and above. On the contrary, I'ld like to see the results order them 1.0 and below but not less than 0.7
So if there were abt 500 compounds that had a similarity threshold of 0.7, with 100 between 1.0 and 0.9, 200 between 0.9 and 0.8 and 300 between 0.8 and 0.7, then I get the last 100, that's closer to 0.7. However, i am expecting to see those 100 between 1.0 and 0.9.
Is there a way that I can reverse this order, so that I get the most similar compounds in my result set instead of the least similar, when I restrict my max hits value to being less than unlimited
The hits are always returned in increasing order of dissimilarity.
This means that the most similar compounds of the hit list will appear first.
If you specify a max hit count, the searcher may collect enough hits and stop even before processing the whole table.
So you will see only those structures that have been found so far.
Of course they will appear in correct order.
If you want the results to start with the most similar structure from the whole table, you must always set the maximum result count to unlimited.
Hmm.. isn't that a paradox? So, if I wanted to see the top 100 similar compounds in a set of 500 hits, I wouldn't be able to see them by restricting my hits to 100. My understanding of restricting max hits to a certain value (say 100) is to give the top 100 hits.
I admit that probably it's more logical to always give back the top hits, maybe we will modify our code accordingly in the future.
This change means that the whole table is processed for every similarity search.
This is not a problem for normal similarity search (since it's very fast), but
can increase the expected search time for Molecular Descriptor (e.g. pharmacophore) similarity.
We will examine whether this change can cause any problems in existing JChem based systems.
We have decided we will implement the mentioned modification from JChem 3.1 (the next major release).
Thanks Szilard. We appreciate your thoughts
I am trying to get the similarity scores associated with the returned list of a JChemSearch. Let's assume I got a 100 compound, if I do the following,
float  scores = searcher.getDissimilarity();
Does the array "scores" contain the 100 scores?
Yes, the call will return the dissimilarity values.
The size and the order of the array will match the size of the result array, therefore you will get 100 dissimilarity scores indeed.
Please beware, that if you need the 100 most similar cd_id values and scores from a structure table, you have to set the maximum result count to unlimited (default), and use the first 100 values from the larger result set.
If you specify 100 for maxResultCount, the search will stop after finding 100 compounds that are similar enough to match the specified similarity threshold, these may not be the most similar ones.
In the discussion above I have stated that this behavior will change from JChem 3.1.
After a second review we have found that the mentioned change could indeed cause unexpected problems in existing programs based on JChem, so we will not change the behavior after all.
We will better document the proper method of usage, so it won't be a problem after all.
Thanks, Szilard. I will keep your suggestion in mind while using the function.