Multiple jc_compare clauses give JChem error

User f698d0529d

22-08-2005 09:10:15

Hi


If I write the following query, it runs fine.





Code:



SELECT   v.PARENT_ID, v.VERSION_ID, v.ISOSMILES, ROUND(v.MOLECULAR_WEIGHT, 2) MOLECULAR_WEIGHT, v.MOLECULAR_FORMULA,


         NULL Similarity


    FROM VERSION v


   WHERE 


jc_compare(v.ISOSMILES, 'Nc2ccc3ccccc3c2',


'sep=! t:s!filterQuery:select rowid from eoaiacd.version v where v.parent_id = 20251'


) = 1


ORDER BY v.molecular_weight ASC;








However, if I want to add another jc_compare clause for a different substructure search, jchem returns an error





Exception: Illegal option: filterquery





Code:



SELECT   v.PARENT_ID, v.VERSION_ID, v.ISOSMILES, ROUND(v.MOLECULAR_WEIGHT, 2) MOLECULAR_WEIGHT, v.MOLECULAR_FORMULA,


         NULL Similarity


    FROM VERSION v


   WHERE 


jc_compare(v.ISOSMILES, 'Nc2ccc3ccccc3c2',


'sep=! t:s!filterQuery:select rowid from eoaiacd.version v where v.parent_id = 20251'


) = 1


OR 


jc_compare(v.ISOSMILES, 'C1CCCCC1',


'sep=! t:s!filterQuery:select rowid from eoaiacd.version v where v.parent_id = 20251'


) = 1


ORDER BY v.molecular_weight ASC;








I need to be able to write this kind of query because my application supports searching for multiple substructures simultaneously, and the user can specify that a structure either must match or must not match. Before, I was using regular jc_contains syntax. This was really slow, and the reason is because we already had a very filtered subset of rows to look at, using other, non chemical, criteria (in the case above, the parent id). Since this is what jc_compare seems to have been designed for, I thought I would try that.





Thanks for your help.


Mark

ChemAxon aa7c50abf8

22-08-2005 12:33:13

Mark,





You most probably get the error message because the jc_compare operator is invoked as a function. The filterQuery option is pointless in such situations and is therefore intentionally rejected.





Your problem is the result of another issue which we are currently working on: if two jc_idxtype operators are combined with an OR condition in the WHERE clause, the optimizer is willing to use domain index scans only if the row sets resulting from the domain index scans can be combined using a BITMAP OR operation. If the BITMAP OR operation is not available (as with Oracle 10g Standard Edition), the optimizer will always invoke the jc_idxtype operators in functional mode as downstream filters -- typically on top of a full table index scan.





In such cases, the jc_compare operator is applied/invoked once against each structure in the "pipelined" row set. Therefore (a) pre-filtering against the entire table (using the filterQuery option) is not meaningful; (b) specifying conditions with "regular" operators in the WHERE clause of the containing SQL query has no performance disadvantage.





Peter





PS:


Is not possible to enable bitmap indexes with the Standard Edition?

User f698d0529d

22-08-2005 13:23:29

From a conversation with our DBA, it looks like it is not possible to enable bitmap indexes in the standard edition. It is an enterprise only feature.





Also, just to clarify about the problem, you said





"specifying conditions with "regular" operators in the WHERE clause of the containing SQL query has no performance disadvantage. "





I think we have discussed before about the undesirability of having to write SQL differently depending on what the developer knows. However, I do accept there are times when it has to be done. Just for clarification, and probably unecessarily, here is the problem





Code:



--A


SELECT   v.PARENT_ID, v.VERSION_ID, v.ISOSMILES, ROUND(v.MOLECULAR_WEIGHT, 2) MOLECULAR_WEIGHT, v.MOLECULAR_FORMULA,


         NULL Similarity


    FROM VERSION v


   WHERE v.parent_id IN


(20251,28864,479463,7671,479523,11083,250925,5530,58199,230169,243459,242352,248263,50359,632,71713,243233)


         AND jc_contains(v.ISOSMILES, 'Nc1ccc2ccccc2c1') = 1


ORDER BY v.molecular_weight ASC;





--B


SELECT   v.PARENT_ID, v.VERSION_ID, v.ISOSMILES, ROUND(v.MOLECULAR_WEIGHT, 2) MOLECULAR_WEIGHT, v.MOLECULAR_FORMULA,


         NULL Similarity


    FROM VERSION v


   WHERE jc_contains(v.ISOSMILES, 'Nc1ccc2ccccc2c1') = 1


ORDER BY v.molecular_weight ASC;





--C


SELECT   v.PARENT_ID, v.VERSION_ID, v.ISOSMILES, ROUND(v.MOLECULAR_WEIGHT, 2) MOLECULAR_WEIGHT, v.MOLECULAR_FORMULA,


         NULL Similarity


    FROM VERSION v


   WHERE jc_compare(v.ISOSMILES, 'Nc1ccc2ccccc2c1',


'sep=! t:s!filterquery:select rowid from eoaiacd.version v where v.parent_id IN (20251,28864,479463,7671,479523,11083,250925,5530,58199,230169,243459,242352,248263,50359,632,71713,243233)'       


       ) = 1


ORDER BY v.molecular_weight ASC;








I want to execute the query A above. However, although the parent id field is indexed, running query A takes the same length of time as query B. Query C takes about 10% of this time, but is undesirably different from A, and moreover cannot be made work if there is more than one substructure to be specified simultaneously.





I have given the explain plans and times I see below.





A - 6 seconds


Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop





SELECT STATEMENT Optimizer Mode=ALL_ROWS 1 477.029038274469


SORT ORDER BY 1 70 477.029038274469


TABLE ACCESS BY INDEX ROWID EOAIACD_V15.VERSION 1 70 476.029038274469


DOMAIN INDEX EOAIACD_V15.JC_IDX_VERSION 18


B - 6 seconds


Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop





SELECT STATEMENT Optimizer Mode=ALL_ROWS 2 K 520.481361335515


SORT ORDER BY 2 K 172 K 520.481361335515


TABLE ACCESS BY INDEX ROWID EOAIACD_V15.VERSION 2 K 172 K 476.029038274469


DOMAIN INDEX EOAIACD_V15.JC_IDX_VERSION





C - 172 ms


Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop





SELECT STATEMENT Optimizer Mode=ALL_ROWS 2 K 520.481361335515


SORT ORDER BY 2 K 172 K 520.481361335515


TABLE ACCESS BY INDEX ROWID EOAIACD_V15.VERSION 2 K 172 K 476.029038274469


DOMAIN INDEX EOAIACD_V15.JC_IDX_VERSION


Mark

ChemAxon aa7c50abf8

22-08-2005 19:26:52

For the kind of queries in your original posting, I suggest to put jc_idxtype operators in multiple SELECTs and combine the SELECTs with the UNION operator, until the issue with the "missing BITMAP OR operation" is solved.





As to your second posting, I understand that it is a generalization of the problem in your first posting. Here you have the comments I can offer:





With a typical Oracle data cartridge implementation, the solution to the issues you brought up is that the data cartridge implementation interacts with the Oracle optimizer through the extensible optimizer interface. This consists of providing the optimizer with the cost estimations (for the domain index operators) needed to create realistic query plans for the execution of complex SQL queries involving domain index operators. The optimizer is then in a position to choose the truly optimal query plan.





The problem with the current implementation of JChem Cartridge is that the functional invocation of the jc_idxtype operators is by orders of magnitude slower than the index scan invocation. My "unofficial" estimation is that searching a sufficiently large number of structures (in the order of several hundreds or more) takes about 100 times longer through individual function invocations of jc_idxtype operators than through domain index scan.





This means that your preferred statement "A" in your second posting might be indeed a good candidate for "classical" optimization. I consider here two possible query plans: (a) a domain index scan is done using jc_contains to generate the set of rows which will be further filtered based on the parent_ids specified with the IN operator; (b) a index range scan is performed on the parent_id to generate the set of rows whose structures are fed to multiple invocations of the jc_contains operator. Since parent_id is a "simple" data type, picking the rows with matching parent_ids will be presumably very fast. The decisive factor will be the speed of the jc_contains operator.





Based on the sample data you provided, it is likely that with query plan (a), the execution time of the domain index scan (with presumably a significantly large result set) takes long enough to comfortably offset the combined execution time of the few functional invocations needed with query plan (b). If we provided the correct estimation about the times of the two execution modes of the operator (domain index scan vs. functional invocation), the optimizer would presumably choose to do an index range scan on the parent_id column and feed the structures from the matching rows to the jc_contains operator. And doing so would probably result in a significantly better execution time for query "A" (in my very rough estimation about 1-2 seconds) than the current 6 seconds.





However, due to the very long execution time of functional invocations, the potential efficiency of the "classical" optimization for query "A" is dependent on the particular predicates used resulting in a pretty long domain index scan versus only a few functional invocations. In a very large number of cases, the "classical" optimization (extensible optimizer support), though accurate (executing the truly least expensive plan) would be slow. In certain subset of these large number of cases the filterQuery option is more effective than "classical" optimization would be. And even in those cases where "classical" optimization would make a noticeable difference (see the hypothetical reduction of the query time "A" from 6 to 2 seconds), the filterQuery option performs even better.





We are considering ways to speed up the functional invocations in order to significantly increase the number of cases where "classical" optimization would be effective. (Although I am talking about "number of cases" here, obviously these proportions are hard to quantify.) I wonder, for example, how typical the statement "A" in your second posting is. Do you expect to have many statements of this kind which could be (according to the reasoning above) effectively optimized through support for "classical" optimization (as opposed to the use of the filterQuery option)? Or did you "think it up" to exemplify your point?





I am not sure if I added anything to the discussion... (Would less have been more?)





Peter

User f698d0529d

23-08-2005 08:56:03

Thank you for the explanation. Just to emphasise that this kind of SQL is generated routinely by my application, and I did not "think it up". The actual SQL generated is more complicated, but I simplified it to show the central problem.





Incidentally, I see in this forum another posting about multiple SMILES matching, and you mention that it is planned in a future release. Depending on how this will work, it might obviate the need for my kind of queries.





If jc_contains and jc_compare could specify a set of SMILES/SMARTS to match, with a syntax indicating how the match is to be performed, and the match was done all in one index scan, that would be ideal. I don't know if it is possible.





The syntax would need to support AND, OR and NOT (and maybe XOR), with a means of specifying precedence.





For example, the hypothetical syntax





where jc_contains(SMILES, '((c1ccc2ccccc2c1)|(c1ccccc1))!(C1CC[!#6]C1)' = 1





meaning the target must contain napthalene or benzene, but must not contain a heteroatom cyclopentane ring.

ChemAxon aa7c50abf8

24-08-2005 10:53:32

Maybe, something like the following would do the job:





Code:
where jc_compare(structure, '*', 't:s ctFilter:(match("c1ccc2ccccc2c1")||match("c1ccccc1"))&&!match("C1CC[!#6]C1")') = 1






The '*' as query maybe not optimal, but I hope that finding '*' in any given structure is a quickie...

ChemAxon aa7c50abf8

24-08-2005 11:29:53

Forget about my previous posting. That solution is not efficient. The searches in ctFilter will not use fingerprints and will perform graph search on all structures in the table.





Sorry.





You will have to wait for the "real" multiple queries feature.

ChemAxon aa7c50abf8

31-08-2005 15:31:55

Mark,
Quote:
If jc_contains and jc_compare could specify a set of SMILES/SMARTS to match, with a syntax indicating how the match is to be performed, and the match was done all in one index scan, that would be ideal. I don't know if it is possible.





The syntax would need to support AND, OR and NOT (and maybe XOR), with a means of specifying precedence.


A solution of this kind has the disadvantage that you cannot specify multiple filterQueries with multiple query structures packed into a single jc_compare operator. From this perspective, combining SELECT statements with UNION, INTERSECT and MINUS is more flexible. What do you think?





Peter

User f698d0529d

31-08-2005 17:16:58

Peter


If I have understood you correctly, then you are suggesting that the disadvantage is that it would not permit you to perform a query such as





"Select compounds which either


A. Contain benzene and belong to project A


or


B. Contain cyclohexane and belong to project B"





because the ctFilter is ANDed to the filterQuery, and you can only have one ctFilter and one filterQuery per jc_compare clause?





Personally, my application does not support those kind of queries anyway.





However, I would suggest that the ability to specify a multiple search chemical syntax would not be a disadvantage, because if you needed to write these queries, you could revert to using Oracle set operations using two select queries, as you suggest. I would see the ability to do all the search in one select statement, where appropriate, as an advantage in cases where this was not required. Especially if there was a performance improvement.





Just to keep you updated with my progress. I have decided to write all my JChem SQL using the jc_compare function, since it seems to do everything I need for my application. The suggestion you gave, even though it may involve a full table scan, is still sufficiently fast for my purposes.





I will let you know if I encounter any more problems.


Thanks


Mark

ChemAxon aa7c50abf8

01-09-2005 08:42:51

Mark,
Quote:
If I have understood you correctly, then you are suggesting that the disadvantage is that it would not permit you to perform a query such as





"Select compounds which either


A. Contain benzene and belong to project A


or


B. Contain cyclohexane and belong to project B"





because the ctFilter is ANDed to the filterQuery, and you can only have one ctFilter and one filterQuery per jc_compare clause?
Yes, this is what I meant.
Quote:



I would see the ability to do all the search in one select statement, where appropriate, as an advantage in cases where this was not required. Especially if there was a performance improvement.
Yes, multiple query structures with "and-or-syntax" should have a performance advantage over set operations on multiple selects.





Peter

User f698d0529d

01-09-2005 08:45:59

Hello Peter


I just received your email about the new version of JChem. Thank you. I tried to reply by email, but there is a problem at the moment, so I thought I would post here instead.





I just have a query before I install it.





Every time that there is a new version of Jchem, we currently go through the whole installation process, which involves dropping all Jchem functionality from the database, and reinstalling, including rebuilding the indexes.





However, is it possible that some releases, which are minor modifications or fixes, do not require this time consuming process? For example, is it possible that some fixes involve changes to the Jchem streams application only, in which case the upgrade might only involve shutting down tomcat and replacing the jchem streams application?





Thanks


Mark

ChemAxon aa7c50abf8

01-09-2005 09:54:06

Mark,





Theoretically, you do not always necessarily have to reinstall/rebuild everything. Especially so when the PL/SQL code is not changed. One reason why the upgrade process is so painful is that the index type (and the packages it relies on) is (are) defined in PL/SQL and it has (they have) to be dropped and recreated as part of the upgrade. To do so you have to drop everything else depending on these PL/SQL objects.





If the PL/SQL code is not changed from one version to another and the JChem table version is also the same, then it is sufficient to reload the jchem.jar into Oracle and Tomcat. This particular bug fix version (3.0.15), however, required many changes in the PL/SQL code (just do a diff on jchem_cart.sql), so you have to uninstall and recreate everything.





I will research to find out if we can improve the current upgrade procedure.





Peter

User f698d0529d

01-09-2005 10:22:14

Hi Peter.


I am a little confused about which version of JChem I should now install.


When I opened this thread, I was using 3.0.14, which suffered from both the case problem with filterQuery and the problem with incomplete fetching.





You then said that the case problem was fixed in 3.1, and the fetch problem would be fixed in a future release.





So I installed 3.1, figuring that at least I would have one less problem.





But now you say that both problems have been fixed in a release called 3.0.15. If I go ahead and install this, I may solve both problems, but won't I lose the other advantages/fixes that 3.1 has? Why did you not apply the fixes to the 3.1 release instead?





Or am I just confused?





Thanks


Mark

ChemAxon aa7c50abf8

01-09-2005 11:32:31

Hi Mark,





The bug I said had been fixed in 3.1 was also fixed in our 3.0 branch in the repository. Then a newer bug has been fixed both in the 3.0 and the 3.1 branch in our repository. Consequently, when 3.0.15 was released, it contained fixes for both bugs. The same fixes will be contained in 3.1.1 which has yet to be released.





The general idea is that the more recent the release, the more bugs are solved in it. The most bug-free release is 3.0.15 currently.





Note the difference between bug fixes and features. Certain features (like structural keys) are contained only in the 3.1.x series (and upwards).





Peter