how to filter duplicate items - ChemAxon Forum Archive

User 511751460d

29-11-2011 07:27:00

Dears:

how to filter duplicate items

for example

I want to filter duplicate items by structure、 Cas Reg No.、chemical name , or other items

how to carry out it?

can you add the filter duplicate function in the right-click?

when I click the filter duplicate，all the same items are shown to me。I can selectively choose to delete or modify there manually。

ChemAxon fa971619eb

30-11-2011 08:50:28

You can do duplicate filtering at the the structure level. To do this go to the schema editor and look at the properties of your structure entity and check the duplicate filtering option. Once this option is saved no more duplicates can be added to. Trying to add a duplicate will fail. This option can also be set when you create the table/entity so preventing any duplicates. This option works at the structure level. So if you try to enter the same structure in two different formats it will still be considered a duplicate.

To prevent duplicates of text fields like CAS number then you can create a unique index on the column in the database. Again this can be done in the schema editor. Find the field, right click on it and choose 'New index' and then make sure that you check the 'Unique keys' option.

Finally, if your entity/table already has duplicate structures then you can use the Overlap analysis function to identify them: http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html

TIm

User 511751460d

05-12-2011 13:15:22

thanks .

User 6ba4d9dd23

16-12-2011 19:51:42

Hi Tim,

I have the similar question to remove the duplicate item. I can not find the duplicate filtering option to check at eh properties of structure entity

ChemAxon 2bdd02d1e5

19-12-2011 07:55:04

You can find it in Data trees editor. See the image...

Filip

ChemAxon fa971619eb

19-12-2011 08:58:08

Yes, its a property of the entity (and also shown in the data tree editor). It is not a property of the structure field.

Tim

User 511751460d

02-03-2012 13:25:10

tdudgeon wrote:

You can do duplicate filtering at the the structure level. To do this go to the schema editor and look at the properties of your structure entity and check the duplicate filtering option. Once this option is saved no more duplicates can be added to. Trying to add a duplicate will fail. This option can also be set when you create the table/entity so preventing any duplicates. This option works at the structure level. So if you try to enter the same structure in two different formats it will still be considered a duplicate.

To prevent duplicates of text fields like CAS number then you can create a unique index on the column in the database. Again this can be done in the schema editor. Find the field, right click on it and choose 'New index' and then make sure that you check the 'Unique keys' option.

Finally, if your entity/table already has duplicate structures then you can use the Overlap analysis function to identify them: http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html

TIm

Hi, other questions

how to filte the duplicate text fields which are alread inputed into jchem.

how to delet empty text fields?

thanks

ChemAxon 2bdd02d1e5

05-03-2012 18:50:42

There is no simple way how to filter duplicate text items directly within the IJC. However there are various ways how to find duplicate items. Depends on what you want to do with duplicates(ignore multiple occurences automatically except the first or just identified them?). If you are using MySQL or Oracle DB the simplest way is to find them by using SQL command:
i.e. select field,count(FIELD) from TABLE group by FIELD having count(FIELD) > 1 and it will print you rows which have duplicates. Or it can be implemented by a Groovy script from IJC, for which you need some experiences with it.

Empty text fields can be found by running query "is null" on given field. Then these can be deleted by selecting and deleting multiple rows. Is that what you want to do?

Filip

User 511751460d

06-03-2012 05:30:47

Dear Filip :

Thanks for your reply.

Yes ,According to your instructions I filted the empty text fields easily.

But ,I think it is very useful to jchem to add find empty structurse and duplicate text fields or strcutures function .As chemfinder,it has the function of find empty structures duplicates by unique structures only or clustered list.

The current version of ijc，it is very difficult to find the empty structures and deplicate structures or text fields from thousants of items which are already exieted in a data tree.

For example，some compounds have structure but have deplicate catalog number ，some compounds have no structure but have other information. I want to find out these and modify 。

Thanks very much。

ChemAxon 2bdd02d1e5

13-03-2012 13:09:06

Hello Lipan,

I'm sorry for the deleyed response, I was not available last week. I was thinking about other possibilities for you and have found solution to some problems.

Empty structures can be easily found when searching using Chemical Terms. In query mode right click on the structure field and chose Chem Terms... as the expression write "(atomCount <1 )", leave the structure field empty and proceed with search. All rows with empty structures are retrieved.

Duplicate structures can be searched by Overlap Analysis (Chemistry -> Overlap Analysis). Set the same desired table as the Query table and as the Target table. Duplicate search mode will add the fields Overlap count and Overlap hits to the table. Then simply search on Overlap count field > 0. You can see ID of duplicates in Overlap hits field.

Searching duplicate text items is only possible via direct access to the database or by writing small script for it(would it be feasible for you?).

Thanks for your feedback.

Filip

User 511751460d

17-03-2012 08:19:08

Hi Filip

Thanks very much.

According to your introduction：

I can find empty structures useing mol weight <=1 or formula in null.But there is no right click list on the structure field.

I can use overlap analysis to find out duplicate structures,but still don't know how to find duplicate text.

Best regards,

ChemAxon e189db4705

23-03-2012 17:08:41

Hi,

the attached scripts can be used for finding duplicates in text or numeric fields. They can be used the following way:

Create a new script under datatree where you want to filter duplicates (right click to datatree node and choose New script)

Open one of the attached script and paste the text to editor

Save the script and execute it

Open gridview

Go to Lists and Queries panel

There should be a new temporary list called e.g. "Duplicates in Formula field (314)". Double click to this list and it will be applied to your current results

Sort by that field where you are looking duplicates

Now you should see only rows where the value of the field (e.g. Formula) exist in more occurrences in the table. And the table is sorted by this field, so you see the duplicated values grouped.

There are two scripts attached. The simple one requires you specify field name inside the script. In example it's set to "Formula". The other one uses dialog chooser so it allows you to select a field where you are looking for the duplicates.

Let us know, please, if it works for you.

Petr

User 247c00dc1d

26-03-2012 17:09:10

phamernik wrote:

Hi,

the attached scripts can be used for finding duplicates in text or numeric fields. They can be used the following way:

Create a new script under datatree where you want to filter duplicates (right click to datatree node and choose New script)

Open one of the attached script and paste the text to editor

Save the script and execute it

Open gridview

Go to Lists and Queries panel

There should be a new temporary list called e.g. "Duplicates in Formula field (314)". Double click to this list and it will be applied to your current results

Sort by that field where you are looking duplicates

Now you should see only rows where the value of the field (e.g. Formula) exist in more occurrences in the table. And the table is sorted by this field, so you see the duplicated values grouped.

There are two scripts attached. The simple one requires you specify field name inside the script. In example it's set to "Formula". The other one uses dialog chooser so it allows you to select a field where you are looking for the duplicates.

Let us know, please, if it works for you.

Petr

Thanks for a great script!
but I have a problem - it finds duplicates in all fields except the "structure"... but error does not give... what is wrong?

Thanks!

ChemAxon 2bdd02d1e5

26-03-2012 17:45:03

It is not designated for searching duplicates among structures. There is overlap analysis for it. See http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html

Which error do you mean? I got this when trying to search structure field:
ERROR X0X67: Columns of type 'BLOB' may not be used in CREATE INDEX, ORDER BY, GROUP BY, UNION, INTERSECT, EXCEPT or DISTINCT statements because comparisons are not supported for that type.

Filip

User 247c00dc1d

26-03-2012 18:39:26

fzimandl wrote:

It is not designated for searching duplicates among structures. There is overlap analysis for it. See http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html

Which error do you mean? I got this when trying to search structure field:
ERROR X0X67: Columns of type 'BLOB' may not be used in CREATE INDEX, ORDER BY, GROUP BY, UNION, INTERSECT, EXCEPT or DISTINCT statements because comparisons are not supported for that type.

Filip

I do not obtain a error message, when I choose structure - it just return 0 duplicates, but when I looking for duplicates by other fields, as example by formula or by molar mass it is working clearly.

I'm using first script with choosing of field ability .

Igor.

User 511751460d

29-05-2012 02:16:22

tdudgeon wrote:

You can do duplicate filtering at the the structure level. To do this go to the schema editor and look at the properties of your structure entity and check the duplicate filtering option. Once this option is saved no more duplicates can be added to. Trying to add a duplicate will fail. This option can also be set when you create the table/entity so preventing any duplicates. This option works at the structure level. So if you try to enter the same structure in two different formats it will still be considered a duplicate.

To prevent duplicates of text fields like CAS number then you can create a unique index on the column in the database. Again this can be done in the schema editor. Find the field, right click on it and choose 'New index' and then make sure that you check the 'Unique keys' option.

Finally, if your entity/table already has duplicate structures then you can use the Overlap analysis function to identify them: http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html

TIm

Dear

another qustion

I creaded a unique index(eg:cas no),when I inport a sdf; all of the compounds which have no cas number can not been imported.

ChemAxon 2bdd02d1e5

29-05-2012 10:14:15

Only unique values are allowed on indexed column when index has unique constraint. You can not import the rows with blank CAS NOs because there is already one row with no CAS NO present (first entry in the SD file which has no CAS NO).

If you need to import such file don't thick Unique keys checkbox when creating index.

Best regards,

Filip

User 511751460d

29-05-2012 13:37:52

Thanks very much.

Are the any ways to import such file when created a unique values?

Because I want to filter out the duplicated cas no,but import these compounds with blank CAS NOs.

ChemAxon 2bdd02d1e5

29-05-2012 16:50:03

Then I would recommend to go to Database Tables view and delete the index with Unique keys. You can recreate the index without Unique keys contraint just after. Then you should be able to import rows with duplicate values if it is not restricted on the column.

You can still find the duplicates when using the scripts above.

Does it help in your case?

You are very welcome.
Filip

User 511751460d

01-06-2012 14:09:31

Although a little trouble, but this method is very effective.

Thanks very much.