How to identify duplicate structures?

User 52a4e280f0

10-09-2012 12:25:27

Hi there,


Is there a way to identify from the existing structures already loaded in our relational databse, ones which would be either considered as duplicates or as orphaned after some new standardization rules have been applied? 


Many thanks,


Isha

ChemAxon 60ee1f1328

10-09-2012 12:53:41

You can use the duplicate check tick box at entity creation time this should not allow any duplicate row to be created...


If you apply a standardizer rule which you think can effectively create duplicates in CD_SMILES then I think you would need to derive a query in order to check if this is indeed the case. If you get any rows at all then they each represent some duplication. If you have used a cart index I don't think this can be possible since the index will have the duplicate flag set - Hopefully you would see the output of such an attempt in a log...


SELECT cd_smiles,count(cd_id)

FROM jchemtable

GROUP BY cd_smiles

HAVING COUNT(cd_id) > 1


User 52a4e280f0

10-09-2012 13:55:09

Thanks Daniel for your quick reply...


If I run this sql query on my relational database, I get dozens of rows with counts as 2.


Does this mean that these would be the the duplicate structures created while loading with the old standardization rules?


Can we simply delete them and try reloading them with the new standardization rules?


OLD:



<StandardizerConfiguration>


   <Actions>


    <Aromatize ID="Aromatize" Type="general"/>


   <Transformation ID="PlusMinus" Structure="[*+:1]-[*-:2]&gt;&gt;[*:1]=[*:2]"/>


   <Transformation ID="PlusMinusDouble" Structure="[*+:1]=[*-:2]&gt;&gt;[*:1]#[*:2]"/>


   <Transformation ID="nitro" Structure="O=[N:1]=[O:2]&gt;&gt;[O-:2][N+:1]=O


        |(7.26,,;5.72,,;4.18,,;15.5,.2,;16.83,-.57,;18.16,.2,)|"/>


   <Transformation ID="azide" Structure="N=[N:1]#[N:2]&gt;&gt;N=[N+:1]=[N-:2]"/>


   <Transformation ID="enamine" Structure="[H:4][N:3][C:1]=[C:2]&gt;&gt;[H:4][C:2][C:1]=[N:3]"/>


   <Transformation ID="enol" Structure="[H:4][O:3][C:1]=[C:2]&gt;&gt;[H:4][C:2][C:1]=[O:3]"/>


   <Transformation ID="ammoniumhalide" Structure="[F,Cl,Br,I-:3].[H:2][N+:1][#6]&gt;&gt;[F,Cl,Br,I:3][H:2].[#6]          


        [N:1]"/><Transformation ID="sulphate" Structure="[#6:1]([OH1:5])(=[O:3])=[O:4]&gt;&gt;[#6:1]([O-:5])(=


        [O:3])=[O:4]"/>


   <Transformation ID="carboxylate" Structure="[#6:1][C:2]([OH1:4])=[O:3]&gt;&gt;[#6:1][C:2]([O-:4])=[O:3]"/>  


   <RemoveExplicitH ID="RemoveExplicitH" Groups="target"/>


   <Sgroups ID="Ungroup" Act="Ungroup"/>


  </Actions>


</StandardizerConfiguration>



 


NEW:


<StandardizerConfiguration Version='1.0'>


   <Actions>


      <Aromatize ID='Aromatize' Type='general'/>


      <Transformation ID='PlusMinus' Structure='[*+:1]-[*-:2]&gt;&gt;[*:1]=[*:2]'/>


      <Transformation ID='PlusMinusDouble' Structure='[*+:1]=[*-:2]&gt;&gt;[*:1]#[*:2]'/>                      


      <Transformation ID='enamine' Structure='[H:4][N:3][C:1]=[C:2]&gt;&gt;[H:4][C:2][C:1]=[N:3]'/>


      <Transformation ID='enol' Structure='[H:4][O:3][C:1]=[C:2]&gt;&gt;[H:4][C:2][C:1]=[O:3]'/>


      <Transformation ID='ynol' Structure='[H:4][O:3][C:1]#[C:2]&gt;&gt;[H:4][C:2]=[C:1]=[O:3]'/>


      <Transformation ID='nitroso' Structure='[H:4][C:3][N:1]=[O:2]&gt;&gt;[H:5][O:2][N:1]=[C:3]'/>


      <Transformation ID='alcoholate' Structure='[#6][O:1][Na,K:2]&gt;&gt;[#6][O-:1].[Na,K;+:2]'/>


      <Transformation ID='ammoniumhalide' Structure='[F,Cl,Br,I;-:3].[H:2][N+:1][#6]&gt;&gt;[F,Cl,Br,I:3][H:2].[#6]


         [N:1]'/>


      <Transformation ID='sulphate' Structure='[#6:1]([O-:5])(=[O:3])=[O:4]&gt;&gt;[#6:1]([OH1:5])(=[O:3])=


         [O:4]'/>


      <RemoveExplicitH ID='RemoveExplicitH' Groups='target'/>


      <Neutralize ID='Neutralize'/>


    </Actions>


</StandardizerConfiguration>


 

ChemAxon fa971619eb

10-09-2012 15:13:44

Using the value of the CD_SMILES may not give you the exact answer. There are few cases I can think about wheere this might not work. 


I think the the only really safe way it to run a duplicate search for every strucuture. e.g. read the value of the CD_STRUCTURE column and do a suplicte strucutre search. It will of course find itself, but where there is a duplicate that will also be found.


It should be quite simple to write some Java that does this.


Tim


 

ChemAxon a3d59b832c

10-09-2012 19:33:26

Hi Isha,


 


Tim is correct. You can take each structure, and do a duplicate search on the table again.


If it finds only itself, then it is an orphan record. If it finds more, then it is a duplicate. (Or multiplicate...)


 


If you only would like to de-duplicate the table, an alternative approach is the following:


1. Export the whole table,


2. Create a temporary table with the new standardizer rules and duplicate filtering.


3. Import the previously exported file.


 


As a result, you will get the new table without duplicates - according to the new Standardizer rules.


 


Let us know if you need any further help.


Best regards,


Szabolcs

User 52a4e280f0

11-09-2012 10:30:11

Thanks for these options Tim and Schbolcs. I will try these.


Regards,


Isha 

User 52a4e280f0

11-09-2012 14:58:28

Thanks guys, with this approach we were able to figure out the structures which were considered as duplicates with the new standardization rules.


I am now trying to find the structures which would be split into two or more substructures. Is there a way to figure out these as well?


Many thanks,


Isha

ChemAxon a3d59b832c

11-09-2012 15:50:37

Hi Isha,


 


If I understand the question correctly, I think you can use a Chemical Terms column with expression "fragmentCount()".


If it is more than one, then it is a multi-fragment molecule.


 


See more info here:


http://www.chemaxon.com/jchem/doc/dev/dbconcepts/index.html#calculatedcolumns


http://www.chemaxon.com/marvin/help/chemicalterms/EvaluatorFunctions.html#fragmentcountex


 


Best regards,


Szabolcs

User 52a4e280f0

12-09-2012 14:59:05

 


Hi Szabolcs and Tim,


I performed the steps mentioned by you to find the duplicate structures and then deleted them from the database. I also reloaded the articles which were pointing to those structures assuming that they would point to new original structures upon being loaded again.


But now when we run a search for those structures which were initially considered as duplicates we do not get any result back...


I have attached the mrv file which I used to reload structures with new standardization rules and found that 58 structures were duplicate. Is this correct? There csids were: (24182,24279,24321,24324,24409,24416,24417,24418,24419,24420,24421,24424,24425,24453,24454,24455,24456,24457,24458,24459,24460,24686,24688,24706,25418,25491,25492,25493,26272,27342,27621,27624,28960,28962,29866,32286,32950,32953,32956,32959,32961,32964,32966,32969,32971,32974,32976,32979,32981,32984,32986,32995,33421,33426,33489,33635,33640,33666)


I am afraid if I am doing something wrong here?? Can you please look into this??


Thanks,


Isha



 

ChemAxon a3d59b832c

12-09-2012 15:05:53

Hi Isha,


 


The attachment did not go through. Could you try to attach again?


I have checked the Standardizer configuration, that looks OK.


 


Szabolcs

ChemAxon fa971619eb

12-09-2012 15:08:25

Hi Isha


How did you delete the structures? Using SQL?


This should really be done uising the JChem API. The contents of the strucutre cache will be inconsistent. You willneed to reload the structure cache (restart whatever program or server process you are using to do the searching).


But I don't think that would explain when you got no hits. You presumably left one of the duplicates in the database in each case :-)


Tim

User 52a4e280f0

12-09-2012 15:10:02

Hi TIm,


I deleted these structures using JChem manager.


I haven't restarted the search application though after deleting them...


Thanks,


Isha


 

ChemAxon d9cc14700b

25-09-2012 14:26:26

Hi Isha,


Have you tried to run your search after restarting the application so that the cache is updated, as Tim suggested?


Best Regards,
Gabor

User 52a4e280f0

25-09-2012 14:52:39

Hi Gabor,


I think the problem was related to what I have reported in: 


www.chemaxon.com%2fforum%2fviewpost44390.html" target="_blank" style="font-family: monospace; font-size: 13px; text-align: -webkit-auto;">https://www.chemaxon.com/forum/viewpost44390.html style="color: #000000; font-family: monospace; font-size: 13px; text-align: -webkit-auto;"> 


There are few structures which everytime they get loaded are being treated as new and we get a new cd_id for it.


So, I still await the fix for that problem so that I can continue my investigation here...


Thanks,


Isha