CD_HASH use

ChemAxon 60ee1f1328

07-01-2007 13:05:41

I need some advice please regarding the use of CD_HASH.





Firstly is it available in the Cartridge API (I don't think so) and is this ever likely to be available in the Cartridge API sometime soon?





Obviously, I might wish to use the CD_HASH as a filter for reducing data to join to in a target table, however it is not clear to me how much data would actually be filtered relative to say using a combination of CD_FORMULA and CD_MOLWEIGHT combined as pre duplicate check filter (which will of course include all stereoisomers for a given query) - Is CD_HASH number sufficiently accurate to eliminate stereoisomers in a pre-filter step join? Perhaps a certain level of recursion will give a certain probability of this?





Cheers,


Daniel.

ChemAxon aa7c50abf8

07-01-2007 20:44:08

Hi Daniel,





What is a "pre-filter step join"?





Thanks


Peter

ChemAxon 9c0afc9aaf

07-01-2007 21:06:48

Quote:
Is CD_HASH number sufficiently accurate to eliminate stereoisomers in a pre-filter step join?
No stereo information is used during hash code generation.





The cd_hash column is mainly intended for internal use speeding up duplicate filtering. Structures with identical hash codes have a high probability of being identical, but there is also a chance that different structures obtain the same hash code.


Therefore JChem always runs a graph search before coming to the final verdict.





I'm also unsure what exactly do you want to achieve, but


- We usually recommend filtering out duplicates during the import via the standard tools provided by JChem


- If one also needs the duplicates (because of associated data, etc), you could store the structures only once (W/O duplicates) in a structure table, and refer to this table as a many-to-one relation





Best regards,





Szilard

ChemAxon 60ee1f1328

08-01-2007 14:47:46

A "pre-filter step join" is some meaningless jargon that I made up in a hurry, I will try and be clear about the questionsI pose in future but in this case I slipped!





We now create a smaller target set (relative to single view set) to join against for duplicate filtering of a stage set that has been imported, salt stripped and re-standardized, we use molweight and formula to create this target set but this could very well include stereoisomeric forms and hence why we asked if stereoisomer information was contained in cd_hash - a bit much to ask in the end I think, so we can leave it here me thinks. This leaves us with a worst case scenario for a join as far as we can see as the Chembridge file of ~0.5 million vs ~1 million or so identified as being possible target molecules - we don't think this join will occur in a short time and so as a result we will are looking to manually parrallelise our build i.e. split the input into chunks...





i.e. have you seen examples of a query like this completing


when t4 ~ 1million and t3 ~ 0.5 million? Perhaps if I have infinite memory and time it may be OK? We are looking to expand our internal hardware and so more memory may be available but it would be nice to know if you consider this query to be unreasonable in your experience.





Code:



SELECT DISTINCT t3.smiles,t3.cleansmiles,t3.supplier_no,t3.original_molweight,t3.keepone_molweight,t3.logp,t3.quantity,t3.purity,t3.price,t3.tpsa,t3.meltingpoint,t3.solubility


FROM QUERY t3, TARGET t4


WHERE t3.cleansmiles IS NOT NULL AND (jc_equals (t4.cd_smiles,t3.cleansmiles) = 1 OR jc_compare (t4.cd_smiles,t3.cleansmiles,'t:e exactChargeMatchingOption:e doubleBondStereo:A HCountMatching:E exactStereoMatching:y absoluteStereo:a stereoSearch:y') = 1);








Thanks for your help,


Daniel.

ChemAxon aa7c50abf8

09-01-2007 11:09:31

First step is to check the query plan to see how the search operators are used. Make sure that they are evaluated through domain index scans. If they are not evaluated so with your original statement, try to reformulate it e.g. by moving them out in subselects (one search operator per subselect). Domain index scan is their most efficient way of operation.





Next step is to execute a few sample "stand-alone" structure searches (such as you expect to be typical in the join) and measure their speed. You can then roughly calculate the time required for the join.





Cheers,


Peter