Delete Molecules that are represented x times in DB

User dfeb81947d

13-05-2005 14:17:10

Dear Support,





I'd like to know if there is a fast possibility to find and erased molecules that are represented twice or more in the table, in order to keep a single representation of the molecule.


If I make a perfect JChemSearch on the whole table and delete all the hits. It means that I have to make as many search as there is molecules.


And With deleting a molecule from the structure search, it makes the cache to be load again.


So, I desactivated the use of cache... But each screen are long...


And for a database of 2 or 3 millions structures, it means that the whole table will be cleaned in several months.





Is there a fast way to find and erase structure that are represented twice (or more) in the database?





Actually I do like that:





Code:



//searcher is the JChemSearch object


// id is the id of the query


//pstmt is a PreparedStatement that update the id of the hit on an other table with the id of the query


// query = UPDATE MY_TABLE SET ID = ? WHERE ID = ?


pstmt = conn.PreparedStatement(query);


searcher.setQueryStructure(mol);


searcher.run();


for(i=0;i<searcher.getResultCount();i++) {


   if (id != searcher.getResult(i)) {


      pstmt.setInt(1, id);


      pstmt.setInt(2, searcher.getResult(i));


      pstmt.execute();


      UpdateHandler.deleteRows(ch, searchTable, "WHERE CD_ID="+String.valueOf(searcher.getResult(i));


      commit();


   }


}








Thank you very much for your help.


Kind regards.





Jacques

ChemAxon 9c0afc9aaf

13-05-2005 18:58:03

Hi Jacques,





Maybe it would be a much simpler solution to import the structures again with duplicate filtering ... is it a viable option for you ?





Some quick comments on the code:





- I think this code part runs for every structure. In that case you do not have to create a new PreparedStatement for every structure. (just create one and close it at the very end)





- I don't know what your commit() method does, but probably commits the connection. Without commit the process may be faster.





- the following code part seems to be overcomplicated:





Code:
"WHERE CD_ID="+String.valueOf(searcher.getResult(i))






this should be a simpler solution:





Code:
"WHERE CD_ID="+searcher.getResult(i)









In any case PERFECT search will not use the structure cache, so it should not make any difference if you use it or not.





If you use Oracle, increasing the buffer cache may help to speed up the process.





In version 3.0 and above the prefect search is faster than in previous versions, if you are still using 2.3.4 you may consider an upgrade.





Best regards,





Szilard

User dfeb81947d

16-05-2005 07:53:15

Hi,





Thank you for the reply.


The problem is, that filtering the structure table alone is not enough.


In fact there is an other table which contains cd_id and other information.


When I find duplicate structure, I delete them in the structure table and update the cd_id in the other table.
Quote:



STRUCTURE


cd_id


cd_structure


...





and





DATAS


cd_id


data1


data2


...











where STRUCTURE.cd_id = DATAS.cd_id


example:


the molecule with cd_id = 1 has two duplicate: cd_id = 2 and cd_id = 3.


So I delete in structure table the duplicate and update the other table:


Code:
update DATAS set cd_id = 1 where cd_id = 2;


update DATAS set cd_id = 1 where cd_id = 3;


commit; (here I commit to validate the update)





because of the table DATAS, I can't remove duplicate using JChemManager importation.
Quote:
I think this code part runs for every structure. In that case you do not have to create a new PreparedStatement for every structure
yes of course, it was just a sample to explain that I don't only re


move duplicate but also manage the information of an other table.
Quote:
I don't know what your commit() method does, but probably commits the connection. Without commit the process may be faster.


I commit the connection to validate the update since my connection is on autoCommit(false).
Quote:
In version 3.0 and above the prefect search is faster than in previous versions, if you are still using 2.3.4 you may consider an upgrade.
for this structure table I'm using JChem 3.0.11





Thank you for your help,





Kind regards,





Jacques

ChemAxon a3d59b832c

16-05-2005 08:08:38

Jacques,





It is possible to efficiently pre-filter your table. For duplicate structures the cd_hash column is similar. (However, they could be still different structures, so another perfect search is needed for the final check.)





You can use the following query to collect the possible duplicates:





Code:
select A.cd_id, B.cd_id from structures A, structures B where A.cd_hash = B.cd_hash AND A.cd_id != B.cd_id;






"structures" is the name of the table. I recommend to do the checkup perfect searches in one batch without deleting. This will be fast because the cache will not be reloaded again.





Then, when you have the list of all duplicates, do the deletion and update.





All the best,


Szabolcs

User dfeb81947d

16-05-2005 13:10:30

It's a really good idea, thank you so far ^^.


Is it faster to compare one by one the possible duplicate from the filter, with MolSearch?


Or to query again the whole database with JChemSearch?





Thank you for your help.

ChemAxon a3d59b832c

16-05-2005 13:24:35

Jacques wrote:
Is it faster to compare one by one the possible duplicate from the filter, with MolSearch?


Or to query again the whole database with JChemSearch?
I think it should not make much difference.





Szabolcs

ChemAxon 9c0afc9aaf

16-05-2005 14:14:21

Hi,





Please see this benchmark about import performance:





http://www.jchem.com/FAQ.html#benchmark2





During the import with duplicate filtering a PERFECT search is run for every single input molecule.





As you see the time does not increase exponentially (as one would expect), because the screening with the cd_hash values is very fast and almost always returns only the identical structures for the subsequent graph search algorithm.





Therefore the search time itself is roughly a linear function of the number of searches and the number of duplicates found.


You may check the speed with a pilot project only searching duplicates without deleting them.





As I have mentioned, the cache is not used at all by the PERFECT search mode (so it makes no difference if you turn it off or not), however Szabolcs is right, that other users may search the database while your code is running, so the cache may be constantly refreshed by them.





If you can run your application at some idle time (night , weekend) when no other searches are run, then this should not be a problem.





If you cannot, you have 2 choices for deleting the structures:





A. Delete the rows yourself (without UpdateHandler). This will not register changes in the structure table, so the cache will not be refreshed during the process.


During the process the user's searches may find structures that no longer exist in the database. Our code is ready for such situations (e.g. the JSP example shows an empty structure with the name "DELETED"), so probably it's not a real problem if your custom code can also handle such cases.


After deleting the lines you can trigger a cache update without restarting the server with a dummy delete for example:





UpdateHandler.deleteRows(connHandler, "mytable", "WHERE CD_ID=-1");





(there is no such cd_id as -1)





B. First collect the duplicates, and then delete them en masse.











Regarding speed:





Using MolSearch could be faster than JChemSearch, because JChemSearch wouldn't have to calculate the hash code again.


The problem is that in certain cases the cd_smiles field may be NULL, so you would have to fetch the cd_structure, standardize the target, etc.





The good news is that you can also bring JChemSearch to the same speed by fetching the cd_hash together with the query structure from the table, and passing it to JChemSearch with this method before the search:





void setPreCalculatedHashCode(int hashCode)





This has package friendly access at the moment (not public), so you have to create a "helper" class under the package "chemaxon.jchem.db" that has the right to call this method.





Best regards,





Szilard

User dfeb81947d

20-05-2005 10:44:43

Thank you very much, it works very well and efficiently with a first filter on cd_hash and than compare mol to mol with MolSearch.





Thank you for everything everybody





Kind Regards


Jacques