How to register polymer-containing compounds

User 8139ea8dbd

24-10-2008 18:06:09

Marvin Sketcher can draw polymer-containing compounds, which is very nice. However, we cannot convert that to either smiles/smarts (extended Jchem format). Therefore, our jchem cartridge index for our compound collection table is built on the smiles column, which seems won't be able to handle such registration requests. Any suggestions or workarounds you can think of? Thanks.

ChemAxon a3d59b832c

26-10-2008 22:29:56

We are planning to handle polymers in extended smiles format and in JChem databases from version 5.2. (Expected in the first half of 2009.)





Best regards,


Szabolcs

ChemAxon 9c0afc9aaf

28-10-2008 23:10:41

Hi,





An obvious workaround may be to switch to a format for registration that supports these compounds.





For both JChem and plain tables the format of the inserted structures may be mixed.





If one is indexing a JChem table with the cartridge the the reference for the cd_smiles column is just symbolical, it does not matter if it is NULL for some rows, the structure will still be found.





Szilard

User 8139ea8dbd

29-10-2008 05:15:36

Excellent suggestion!





Two questions for my own education 1) so you are saying cd_smiles actually is not used at all by the cartridge, right? (I initially thought that's what being cached in the structure search server) 2) does it mean the original structure column on which we build the index is the column that being cached in the memory of the structure searching server? If that's the case, in general, it's still preferable to index smiles rather MOL, because the latter would increase the memory requirement of the structure search program. Is it right? (We say 1 million structures roughly use 100 MB server memory, that estimation was made based on the assumption that we are indexing smiles strings, right?)

ChemAxon 9c0afc9aaf

29-10-2008 19:17:28

Hi,
Quote:
1) so you are saying cd_smiles actually is not used at all by the cartridge, right? (I initially thought that's what being cached in the structure search server)
When available the cd_smiles is used indeed for the search, and it is cached in memory.


However to make sure we don't miss hits, if the cd_smiles is NULL we fetch the cd_structure and use that for the search.





Since cd_structure is not in the cache and has to be standardized on-the-fly it is much slower this way. But if you only have a handful of these cases the search speed should not degrade that much. (though slowdown can be more significant for proteins as their darker fingerprint can make them often hit candidates and standardization takes longer for them too)


Quote:
2) does it mean the original structure column on which we build the index is the column that being cached in the memory of the structure searching server? If that's the case, in general, it's still preferable to index smiles rather MOL, because the latter would increase the memory requirement of the structure search program. Is it right? (We say 1 million structures roughly use 100 MB server memory, that estimation was made based on the assumption that we are indexing smiles strings, right?)
It si always the cd_smiles column that is indexed if available (either for JChem tables or in case of regular tables from the index table). It does not matter matter in what type of column you put your index on,


I not available it will not be cached, so the memory requirement will not increase.





Best regards,





Szilard

ChemAxon 9c0afc9aaf

30-10-2008 15:13:21

Some further clarification:





One of my colleagues pointed out have informed me that the brackets of repeating unit definition of these polimers are currently ignored during the search anyway (as if it would be there only once), so you should consider this when planning a substructure search.








My colleague Peter explains caching in a more explicit way:
Quote:
The cd_structure column is not cached. It is always the cd_smiles column that is cached for jc_idxtype-indexes. If no extended smiles value is available for a given structure (cd_smiles IS NULL), only the fingerprints will be cached (which are cached anyway), and the memory requirement of the structure cache will actually decrease, rather than increase. In short: the penalty for structures not having a compact extended smiles representation is on the performance side, not on the memory side.
Here is some more additional explanation on why the cd_smiles can be NULL:





http://www.chemaxon.com/forum/viewpost403.html





Best regards,





Szilard