Duplicates removal

User 204415f4a4

06-03-2006 17:22:38

Dear All,





Is there any way to remove duplicates from an SD file using a command line ?


Also, did you implement a code using InChi for doubloons elimination ?


Thanks.





Best regards,


IsI

ChemAxon 9c0afc9aaf

07-03-2006 12:15:11

Hi,
Quote:
Is there any way to remove duplicates from an SD file using a command line ?
Yes, but you will need a database connection for this.


JChemManager (jcman) can filter duplicate structures during import.


So importing and exporting them seems to be straightforward a solution to get a filtered SDF.


The only problem with the import / export cycle is that SDF data fields are stored in DB fields, so if you do not want to loose data, the SDF fields must have an DB fields equivalent (you have to take care of this at table creation).





Fortunately during the import process jcman can also print the duplicate or non-duplicate structures to the standard output (with data), so it's a more general solution.





1. We need an empty table, lets' create one with the name of "temptable":


Code:



jcman c temptable






2. Now import with filtering (--nodup) and write the unique structures into an new file:





Code:
jcman a temptable input.sdf --nodup --nonduplicates > uniques.sdf






You can also repeat this step for multiple SDFiles, the structures will be globally unique for them, as previous structures are stored in the DB.


(Just make sure to use a different filename or append with ">>" instead of ">")





3. If you no longer need the table, you can simply delete it:





Code:
jcman d temptable
Quote:
Also, did you implement a code using InChI for doubloons elimination ?
Yes, we have recently added InChI import / export support capabilities to our software.


So you may generate InChI and use it in your algorithm for duplicate filtering.


I still recommend JChem's duplicate filtering for the most precise result.


(it uses hash code + graph search)





Best regards,





Szilard