Duplicate checking and standardization

User 7f1dc8dfe3

26-09-2012 22:58:21

There are some major issues with duplicate identification that revolve around standardization. There are may different way to draw valid renderings of some groups, aromaticity definition is a moving target, compounds exist in a number of different ionic forms with different counter-ions, compounds can have multiple tautomers, etc. The use of can smiles presents particular problems if you do not keep to very careful standardization.

Often a task that is of interest to us is to look at compounds that are available from vendors and see which ones we already have and which are new. Since the vendor is responsible for creating the structure file, it is even more difficult to keep to business rules for structure rendering.

To some extent, you need to qualify what you mean by duplicates. If you have a salt with two different counter-ions, are they duplicates? If you have the same compounds in different forms, or of different purity, are they duplicates. If you have tautomers or protomers are they duplicates, etc?

We are working on a system here of using a series of 4 integers in a hash that is used to resister compounds. A hash table returns a pointer to the primary key of any entry in the table that has the same value for all 4 integers. According to our research, these 4 ints are unique for any compound, and it doesn't matter what the rendering, aromaticity, or ionic state of the structure is, you still get the same values for the int keys. When a specific three of the 4 values match, the compounds are tautomers, which is also useful.

This is currently in sqlite databases and I am in the process of seeing if this can be implemented in jchem. If you would like, I will keep you posted as to how this progresses. I don't know jchem well enough to know if there is something similar that has already been implemented or not, but suffice to say it is not as straightforward a problem as it may seem.

LH_medchemist

ChemAxon a3d59b832c

27-09-2012 02:05:22

Hi,

Yes, you can just simply store your integers alongside the structures.

However, in JChem there is a built-in way also. I suggest to check out this part of the manual:

http://www.chemaxon.com/jchem/doc/dev/dbconcepts/index.html#standardizerintegration

(This works for all search types, not just duplicate search.)

Furthermore, you can check out the search options that can be used to modify searching behaviour:

http://www.chemaxon.com/jchem/doc/user/query_searchoptions_index.html

Best regards,

Szabolcs

User 7f1dc8dfe3

27-09-2012 18:26:19

Szabolcs wrote:

Yes, you can just simply store your integers alongside the structures.

Thank you for the response.

The main question is about the hashing. It is one thing to be able to store the ints, it is another thing to be able to quickly check if a structure you want to import has the same value for all four ints as a structure that is already registered in the database, especially when databases run into the hundreds of thousands of compounds, or millions. There are 55+ million structures available from aldrich at present.

Using ruby sqlite, I can create a hash map based on the 4 ints. This lets me do a very quick single operation lookup to see if there is already a primary key associated with the int values of a structure. The map takes the ints and returns a pointer to the primary key, or nothing if there is no key. If there is not a primary key, I can add the structure, if there is a primary key, I can add data to an existing record, etc. The ints come from software that I already need to run anyway, so I am trying to take advantage of that and not re-do processing. Of course, the ruby code has no knowledge of chemistry, or how to display structures, export structure files, etc, so JChem has allot to offer if the hashing method can be implemented resaonably.

I will also look at the JChem method for doing this kind of thing that you posted about.

It seems odd that sqlite is not supported, since there are java drivers avalialbe. At minimum, I would expect JChem base to be able to display data from any database format for which drivers can be obtained. The rest is largly a mapping issue, and the user could do that by them selves if there were appropriate tools.

LH_medchemist

ChemAxon a3d59b832c

28-09-2012 15:23:50

Hi,

I think that hash-type indexes are available in other databases as well, so a similarly performant method should be available.

It seems odd that sqlite is not supported, 

since there are java drivers avalialbe.

We would be the happiest if the different databases differed in their JDBC drivers. :)

Unfortunately, supporting a new database type involves adjusting all the existing code to a new SQL dialect, considering slightly different column types, finding and workarounding performance bottlenecks, setting up matrix builds, testing different versions, etc.

In summary, it is not necessary a small task.

At minimum, I would expect JChem

 base to be able to display data from any database format for which 

drivers can be obtained. The rest is largly a mapping issue, and the 

user could do that by them selves if there were appropriate tools.

Displaying the data works the same way as for any other database engine:

1. Use JDBC to retrieve the source of the molecule.

2. Pass it to Marvin applets/beans or create a Molecule object and generate an image.

See the Marvin developers guide for more details on this latter step:

http://www.chemaxon.com/marvin/help/developer/index.html

Best regards,

Szabolcs