similarity Threshold - ChemAxon Forum Archive

User e34a92cce5

18-03-2005 17:19:57

Hi,

I am using JChemSearch class to do my similarity searches. I would like to see the top 100 results. Now that the searcher has deprecated setSimilarityThreshold, I have to use setDissimilarityThreshold. Now the issue is this. When I get a result set of about 100 hits for dissimilarity threshold of 0.3, the searcher orders them starting at 0.3 and below. When translated into a similarity search (similarity threshold of 0.7), this would mean that the searcher gives me 100 compounds 0.7 and above. On the contrary, I'ld like to see the results order them 1.0 and below but not less than 0.7

So if there were abt 500 compounds that had a similarity threshold of 0.7, with 100 between 1.0 and 0.9, 200 between 0.9 and 0.8 and 300 between 0.8 and 0.7, then I get the last 100, that's closer to 0.7. However, i am expecting to see those 100 between 1.0 and 0.9.

Is there a way that I can reverse this order, so that I get the most similar compounds in my result set instead of the least similar, when I restrict my max hits value to being less than unlimited

ChemAxon 9c0afc9aaf

22-03-2005 18:42:47

Hi,

The hits are always returned in increasing order of dissimilarity.

This means that the most similar compounds of the hit list will appear first.

If you specify a max hit count, the searcher may collect enough hits and stop even before processing the whole table.

So you will see only those structures that have been found so far.

Of course they will appear in correct order.

If you want the results to start with the most similar structure from the whole table, you must always set the maximum result count to unlimited.

Best regards,

Szilard

User e34a92cce5

22-03-2005 18:59:38

Hmm.. isn't that a paradox? So, if I wanted to see the top 100 similar compounds in a set of 500 hits, I wouldn't be able to see them by restricting my hits to 100. My understanding of restricting max hits to a certain value (say 100) is to give the top 100 hits.

ChemAxon 9c0afc9aaf

22-03-2005 19:21:36

Hi,

I admit that probably it's more logical to always give back the top hits, maybe we will modify our code accordingly in the future.

This change means that the whole table is processed for every similarity search.

This is not a problem for normal similarity search (since it's very fast), but

can increase the expected search time for Molecular Descriptor (e.g. pharmacophore) similarity.

We will examine whether this change can cause any problems in existing JChem based systems.

Szilard

ChemAxon 9c0afc9aaf

23-03-2005 11:53:57

Hi,

We have decided we will implement the mentioned modification from JChem 3.1 (the next major release).

Best regards,

Szilard

User e34a92cce5

23-03-2005 14:12:48

Thanks Szilard. We appreciate your thoughts

Renju

ChemAxon 587f88acea

18-08-2005 15:58:24

Hi, there.

I am trying to get the similarity scores associated with the returned list of a JChemSearch. Let's assume I got a 100 compound, if I do the following,

float [] scores = searcher.getDissimilarity();

Does the array "scores" contain the 100 scores?

Thanks,

Donald

ChemAxon 9c0afc9aaf

18-08-2005 16:16:21

Hi,

Yes, the call will return the dissimilarity values.

The size and the order of the array will match the size of the result array, therefore you will get 100 dissimilarity scores indeed.

Please beware, that if you need the 100 most similar cd_id values and scores from a structure table, you have to set the maximum result count to unlimited (default), and use the first 100 values from the larger result set.

If you specify 100 for maxResultCount, the search will stop after finding 100 compounds that are similar enough to match the specified similarity threshold, these may not be the most similar ones.

In the discussion above I have stated that this behavior will change from JChem 3.1.

After a second review we have found that the mentioned change could indeed cause unexpected problems in existing programs based on JChem, so we will not change the behavior after all.

We will better document the proper method of usage, so it won't be a problem after all.

Best regards,

Szilard