Assistance with a duplicate check query

User 0908c5ccdd

20-11-2007 12:45:43

I wondered if you could advise me on a query I need to run please? I have two tables containing different sets of structures. Table A contains almost 1 million structures and Table B contains approx 80,000 structures. I need to establish the alternate ids of structures that exist in both tables, using an exact search. The structure ids (cd_id) are different in both tables, although each has its own alternate identifier. I was hoping to run the query in a single statement, rather than through a PL/SQL procedure involving an explicit cursor.





Something like:





Code:



SELECT a.altid_a, b.altid_b


FROM table1 a, table 2 b


WHERE a.cd_smiles = b.cd_smiles






Although I don't think I can use a.cd_smiles = b.cd_smiles


Is there an alternative using a ChemAxon function on the cd_smiles column that might be suitable to use in the WHERE clause?

ChemAxon aa7c50abf8

21-11-2007 07:49:38

I assume you use JChem Cartridge with JChem structure tables.





I also assume that the tables contain plain targets without query features. Then exact search is equivalent to perfect search and we don't have to worry about which one of the two structures being compared is the query.





Under such circumstances, the supported way to do this would be





Code:
SELECT a.altid_a, b.altid_b


FROM table1 a, table2 b


WHERE jc_compare(a.cd_smiles, b.cd_smiles, 't:p') = 1;
.





To make the query somewhat faster you could do a pre-filter using the cd_hash column which is not part of the public interface (at least not in JChem Cartridge):





Code:
select asmi, bsmi


from (SELECT a.altid_a asmi, b.altid_b bsmi


      FROM table1 a, table2 b


      WHERE  a.cd_hash = b.cd_hash) hashequals


where jc_compare(asmi, bsmi, 't:p') = 1;






A similar pre-filtering on the cd_hash column is actually performed in the first statement 80 000 times with a literal value (the hash value of a structure in B each time) on the left side of the equality condition. A performance gain (if any) may come from the fact that the hash-based filtering may be more efficient when done in one sweep as a join.