User 0f28873a29
13-05-2008 00:02:19
Hi:
Anybody have a good Standarized xml for canonization of Pubchem Database. ?
I try with the default configuration but it doesn't work.
I try to found duplicated compound with this standarized options:
Clear Isotopes
Remove Fragment
Remove Explicit Hidrogens
Neutralize
Aromatized
Thanks in advance
ChemAxon d76e6e95eb
13-05-2008 07:38:21
The default configuration is just a basic example containing some important and some demonstrative actions. The standards usually vary company by company, so there is no golden configuration applicable for all databases.
I uploaded a standardization file here containing transform actions for the most important mesomers. You should certainly add "Aromatize" and "Remove Explicit Hydrogens" actions as well.
Handling salts is a difficult issue, it is always a question, whether you are considering them as duplicates or not. If you are considering them duplicates, the keep the largest fragment option of the "Remove Fragments" action followed by "Neutralize" (as you correctly set them) seems a good approach in most cases. However, PubChem is so huge, you will always find special cases not conforming your rules.
All in all, standardization will improve your search results. The configurations I uploaded here are parts of the newest
Instant JChem release as templates.
ChemAxon d76e6e95eb
01-05-2009 08:31:55
Configuration files including the mesomer transforms are available for download here.