Is substructure search expected to be non-deterministic?

User 57295192cc

03-02-2014 15:34:46

Hi,


I have a question about the results returned by substructure search when the max. results are limited and the search returns only a subset of all possible hits. Based on some testing I found that the same substructure search doesn't always return the same exact hits. All returned hits are correct though and ordered correctly. Is this behaviour expected?


(Apologies if this is something trivial -- I've checked the docs, and couldn't find anything obvious, but might have missed something...)


A simple Java example:


JChemSearch searcher = new JChemSearch();
searcher.setQueryStructure("N");
searcher.setConnectionHandler(getConnectionHandler());
searcher.setStructureTable(...);
ChemSearchOptions searchOptions = new JChemSearchOptions(SearchConstants.SUBSTRUCTURE);
searchOptions.setMaxResultCount(20);
searchOptions.setDissimilarityThreshold(.35f);
searcher.setSearchOptions(searchOptions);
searcher.run();
int[] results = searcher.getResults();
System.out.println("Results: " + Arrays.toString(results));


Running this twice in a row, I get results like this:


Results: [1, 2, 4, 7, 8, 10, 14, 29, 30, 31, 33, 34, 35, 42, 46, 48, 53, 56, 58, 116]
Results: [1, 2, 4, 7, 8, 10, 14, 29, 30, 31, 32, 34, 42, 46, 48, 53, 56, 58, 59, 116]


I also tried it with SIMILARITY and it seemed to be deterministic (although haven't done an exhaustive testing). I tried to play with various search options but it didn't make any difference. I also tried the Instant JChem demo and found the same: the results vary slightly each time.


I did the testing on JChem 6.2.0 but have observed the same behaviour before, with older versions.


 


Many thanks,


Pal

ChemAxon d4fff15f08

04-02-2014 12:20:53

Hi Pal,


 


Your observation is correct! When the returning hits are limited in their number (x hits), we will return the first (in time) x hits, then stop the search engine. By doing so we could not guarantee the same set of hits, especially not in a multi threaded environment.


Using similarity search your hits could be ordered (by relevance i.e. similarity/dissimilarity score), so returning the first x structures will be deterministic. 


 


Best regards,


Norbert

User 57295192cc

04-02-2014 12:26:30

Hi Norbert,


 


Thanks for the quick reply and the explanation, I think the behaviour makes perfect sense then!


 


Many thanks,


Pal