Molecule.equals()

User b7aa615db3

22-07-2005 00:04:50

According to the javadoc and my observation of its use in collections, Molecule uses Object's equals method.





Would it make more sense for Molecule to implement this function?





In particular, I'm trying to use a Set to collect unique molecules, but I'm getting duplicates. Any suggestions besides creating my own subclass?

ChemAxon 9c0afc9aaf

22-07-2005 09:24:27

Hi,





This is not a trivial issue because of the following reasons:





1. The MolSearch class that performs graph searches is only available in JChem, while the Molecule class is also present in the Marvin API





2. To perform a graph search you must have the molecules in a standardized form (otherwise the search will not work correctly).





2.a: The Molecule does not know if it is standardized correctly or not, so each time both structures should be cloned and standardized which means a lot of wasted CPU time.





2.b: The required form of standardization may vary from user to user.





3. Some of our users might already use the equals() in the present form (which is currently equivalent to == ). Changing the code would cause problems for them.





Probably writing a subclass and overriding equals() is not the best solution either, since you would also encounter some of the problems mentioned above.





Writing some code for collecting unique molecules would be the most effective solution.





I recommend the following API to be used:





1. You have to standardize the molecules to have correct search results.


The most important standardization step is the detection of aromatic rings.


You can simply use Molecule.aromatize() for this purpose:





http://www.chemaxon.com/jchem/doc/api/chemaxon/struc/Molecule.html#aromatize(int)





For more sophisticated standardization (e.g. bringing certain functional groups to the same form) you will need Standardizer:





http://www.chemaxon.com/jchem/doc/api/chemaxon/reaction/Standardizer.html





Please see the following page for more details:


http://www.chemaxon.com/jchem/doc/user/Standardizer.html








2. You can perform graph search on the standardized molecules.


Use MolSearch for this:





http://www.chemaxon.com/jchem/doc/api/chemaxon/sss/search/MolSearch.html





For detecting duplicates the PERFECT search mode should be set:





http://www.chemaxon.com/jchem/doc/api/chemaxon/sss/search/Search.html#setSearchType(int)





3. If you typically search each new entry against a lot of structures, you can speed up the duplicate filtering by calculating and storing a hash code for each structure.





http://www.chemaxon.com/jchem/doc/api/chemaxon/sss/screen/HashCode.html





- I the hash code differs for two structures, they are certainly not equal.


- If the hash code is the same, the structures may be equal, a graph search must be performed by MolSearch to determine it for sure.





Although calculating the hash code also takes some time, you can filter out almost all the non-matching structures with a simple and fast integer comparison, greatly reducing the number of MolSearch calls.





Best regards,





Szilard