ChemAxon 60ee1f1328
07-01-2007 13:05:41
I need some advice please regarding the use of CD_HASH.
Firstly is it available in the Cartridge API (I don't think so) and is this ever likely to be available in the Cartridge API sometime soon?
Obviously, I might wish to use the CD_HASH as a filter for reducing data to join to in a target table, however it is not clear to me how much data would actually be filtered relative to say using a combination of CD_FORMULA and CD_MOLWEIGHT combined as pre duplicate check filter (which will of course include all stereoisomers for a given query) - Is CD_HASH number sufficiently accurate to eliminate stereoisomers in a pre-filter step join? Perhaps a certain level of recursion will give a certain probability of this?
Cheers,
Daniel.
ChemAxon aa7c50abf8
07-01-2007 20:44:08
Hi Daniel,
What is a "pre-filter step join"?
Thanks
Peter
ChemAxon 60ee1f1328
08-01-2007 14:47:46
A "pre-filter step join" is some meaningless jargon that I made up in a hurry, I will try and be clear about the questionsI pose in future but in this case I slipped!
We now create a smaller target set (relative to single view set) to join against for duplicate filtering of a stage set that has been imported, salt stripped and re-standardized, we use molweight and formula to create this target set but this could very well include stereoisomeric forms and hence why we asked if stereoisomer information was contained in cd_hash - a bit much to ask in the end I think, so we can leave it here me thinks. This leaves us with a worst case scenario for a join as far as we can see as the Chembridge file of ~0.5 million vs ~1 million or so identified as being possible target molecules - we don't think this join will occur in a short time and so as a result we will are looking to manually parrallelise our build i.e. split the input into chunks...
i.e. have you seen examples of a query like this completing
when t4 ~ 1million and t3 ~ 0.5 million? Perhaps if I have infinite memory and time it may be OK? We are looking to expand our internal hardware and so more memory may be available but it would be nice to know if you consider this query to be unreasonable in your experience.
Code: |
SELECT DISTINCT t3.smiles,t3.cleansmiles,t3.supplier_no,t3.original_molweight,t3.keepone_molweight,t3.logp,t3.quantity,t3.purity,t3.price,t3.tpsa,t3.meltingpoint,t3.solubility
FROM QUERY t3, TARGET t4
WHERE t3.cleansmiles IS NOT NULL AND (jc_equals (t4.cd_smiles,t3.cleansmiles) = 1 OR jc_compare (t4.cd_smiles,t3.cleansmiles,'t:e exactChargeMatchingOption:e doubleBondStereo:A HCountMatching:E exactStereoMatching:y absoluteStereo:a stereoSearch:y') = 1);
|
Thanks for your help,
Daniel.
ChemAxon aa7c50abf8
09-01-2007 11:09:31
First step is to check the query plan to see how the search operators are used. Make sure that they are evaluated through domain index scans. If they are not evaluated so with your original statement, try to reformulate it e.g. by moving them out in subselects (one search operator per subselect). Domain index scan is their most efficient way of operation.
Next step is to execute a few sample "stand-alone" structure searches (such as you expect to be typical in the join) and measure their speed. You can then roughly calculate the time required for the join.
Cheers,
Peter