Which Jcart table column to index for SMILES queries?

User 7f33ec9a5c

24-08-2012 01:52:31

 


We are doing some testing of JChem Cartridge and I'm having a really hard time understanding some of the nuances of the cartridge. 


I'm having a difficult time figuring out best practices for indexing jcart tables, and I can't figure out which column of a jchem table is the preferred format to index, and the implications of indexing different data types.  If there is a document on this, could someone point me to it?   Further, I don't understand why there are so many native columns in a jChem Cartrige table and which columns I should be interested in using.



To explicitly describe my problem:



 I have created a JChem table:

exec jchem_table_pkg.create_jctable('structure_jc1', 'jchem.jc_idx_property', 16, 2, 6, 'jc_structure_id number(15)', null, 1, 'tableType:molecules');

Then populated it using SMILES (this is a code snippit, with much omitted):

  for i in (select s_structure_id, s_smiles from structure ) loop
        a:= jchem.jchem_table_pkg.jc_insert(i.s_smiles,'structure_jc1', 'jchem.jc_idx_property', 'true', 'false', 'haltOnBadFormat:y');

 end;


Following the examples at https://www.chemaxon.com/jchem/doc/dev/cartridge/index.html#create_index I did a

execute immediate 'CREATE INDEX jc_idx ON structure_jc1(cd_structure) INDEXTYPE IS jchem.JC_IDXTYPE'


Now if I try and search using SMILES I get the following errors:


select cd_smiles from structure_jc1 where jc_equals(cd_structure,'CCCCCCCCCC[N+](C)(C)CCCS([O-])(=O)=O') = 1

                                          *

ERROR at line 1:

ORA-29900: operator binding does not exist

ORA-06553: PLS-306: wrong number or types of arguments in call to 'JC_EQUALS'



So I went searching around for how to use jc_equals and I see lots of examples for jc_equals(cd_smiles but none for jc_equals(cd_structure.


My suspicion is that the jc_equals function won't work across the BLOB and smiles data types, as advertised in the documentation, and that I should have indexed cd_smiles instead of cd_structure, but I can't find any examples or best practices on this.




…currently, I'm indexing cd_smiles, but it won't be done for another 15 hours….



~mike

 

ChemAxon aa7c50abf8

24-08-2012 10:39:36

Mike,


I'm having a difficult time figuring out best practices for indexing jcart tables, and I can't figure out which column of a jchem table is the preferred format to index, and the implications of indexing different data types.  If there is a document on this, could someone point me to it?   Further, I don't understand why there are so many native columns in a jChem Cartrige table and which columns I should be interested in using. 

I suggest to consider using regular structure tables and much of this confusion will vanish -- like the example in this post: https://www.chemaxon.com/forum/viewpost43025.html#43025 . The example in this post uses default settings for the index including the fingerprint settings. For custom fingerprint settings create the JChem index with the fp_size, fp_bit and pat_length index parameters:


create index jcxwombat on wombat(structure) indextype is jchem.jc_idxtype
parameters('fp_size=16,fp_bit=2,path_length=6');

ORA-29900: operator binding does not exist

We provide (VARCHAR2, VARCHAR2), (BLOB, BLOB) and (CLOB, CLOB) bindings for JC_EQUALS (, JC_CONTAINS and a couple of other operators) on the assumption that the target and query type will be same. Casting the query to the type of the target should help:


select cd_smiles from structure_jc1 where jc_equals(cd_structure, to_blob('CCCCCCCCCC[N+](C)(C)CCCS([O-])(=O)=O')) = 1;

In most cases, you can avoid having binding issues by using JC_COMPARE with which we provide more bindings:


select cd_smiles from structure_jc1 where jc_compare(cd_structure, 'CCCCCCCCCC[N+](C)(C)CCCS([O-])(=O)=O', 't:d') = 1;

Peter

ChemAxon aa7c50abf8

24-08-2012 11:12:30

A little background on the two table types:



Slide #20 and #21 of the JChem Cartridge (JCC)  technical presentation provides a brief overview of the two table types (regular structure tables that I am suggesting to consider using and JChem Base (JCB) structure tables you started out with): http://www.chemaxon.com/conf/JChem_Cartridge.ppt.



The two table types support the same "chemical" functionality, but there is some minor "non-chemical" differences between the two ("-" denoting disadvantage, "+" advantage for JCB tables):



- Oracle designed its Data Cartridge architecture with base tables (the regular structure table) physically separated from the associated index data tables (the _JCX tables) in mind. In contrast, JChem Base structure tables include the user data (the chemical structures in their original forms plus [optionally] other user data in additional, user-defined columns) and "index data" (data generated during import to make search faster) in the same physical table. One consequence of this is that we have to provide (and users have to use) wrapper procedures around SQLs to avoid violating Oracle constraints on DMLs and queries competing in the same SQL call (such as during INSERTs into duplicate filtered column).

- JCB tables "leak" data (more than regular structure tables) to users used internally for (mainly) searching (CD_SMILES column). This sometimes confuses users -- in particular when it comes to understanding the role/effect of the table's Standardizer configuration. This leakage also makes development/changes difficult, because users get to rely on the table structure.

+ JCB doesn't currently support base-table/index-table pair configurations. (There are hazy plans to provide this kind of support, though.) The advantage is thus on JCB-tables' side for users who want to use both JCB and JCC applications. In addition, even for users of only JCC, the only option for a CXN-supported client application to import structures from operating-system files into structure tables is JChem Manager, a JCB tool currently not supporting base-table/index-table pairs. (Well, you can also use IJC in principle, but (1) it is sold separately from JChem, (2) it is GUI only, (3) you can't index an already populated structure column with it -- which is, I believe, would be the most important capability in this context.)

+ Searching in JCB doesn't have to join base tables and index tables -- a potential performance advantage.

+ Importing with JCB tables is faster than inserting into JChem-indexed regular structure tables.

- Indexing regular structure tables is faster than the initial import into JCB tables

- Support for administering JCB tables through SQL is very limited in JCC on the principle that JCB tables are inferior "by design" in the context of Oracle Cartridge, so the motivation for their use must come from users' need for JCB-based applications -- for which JCB's own set of supporting APIs and tools are available. You have to use JCB-tools (or API), JChemManager in particular, to do basically everything apart from DMLs and SELECTs.



Peter

User 7f33ec9a5c

24-08-2012 17:06:59

When I approach a new development task, I like to sit down and read the background documentation first.  All of Chemaxon's support staff is very knowlageable, and VERY VERY helpful and responsive, but I don't like having to do work by trial/error/forumpost/response/trial....


Where is the comprehensive,correct documentation for the cartridge?


I thought it would be at: https://www.chemaxon.com/jchem/doc/dev/cartridge/index.html, so I read this thouroughly,


But now I am finding out that :


1). The document on the structure caching mechanism at http://www.chemaxon.com/jchem/doc/dev/cartridge/index.html was incorrect and out of date, and I had to get Szilard's input  to understand how the caching is refreshed.


2).  the example for how to build a correct jChem table is at  https://www.chemaxon.com/forum/viewpost43025.html#43025 style="font-size: 13px; line-height: 17px; text-align: -webkit-auto;"> , not in the docs.


3). "Slide #20 and #21 of the JChem Cartridge (JCC)  technical presentation http://www.chemaxon.com/conf/JChem_Cartridge.ppt." Is where I need to go to find the preferred table structure, even though reading the documenation at http://www.chemaxon.com/jchem/doc/dev/cartridge lead me to the conclusion that JChem Base structure tables were the best way to create a JChem Table.


4). I could not get jc_equals to work because I did not have Peter's very important note, that the query type must match the target type.  Where is that in the docs?  I know it specifies that the functions are overloaded, but I don't see where it says they are only overloaded for matching types.


5). clearly after Peter's post, jc_compare() is the only search function we ever want to use, now that I look at other examples, I see this, but on an initial reading of the documentation, It's not so clear that jc_compare() is the best way to go, and I'd even ask why publish all those other search functions when jc_compare() does it all, AND is overloaded to handle multiple types.


JChem Cartrige has LOTS of options, and LOTS of possible functionality, two table types, multiple columns that could be indexed, and multiple functions to do an equivalent search, if you multiply all the possible options, there are hundereds of different ways to do the exact same thing.  How do I navigate through all these possible options and solutions to find the solution that is best for my implementation, when I can't find the documentation to understand all the possibilities? 

User 7f33ec9a5c

24-08-2012 17:26:50

Hi Guys,


I need a quick-start best practices guide for the following:


I have a chemical structure table of approximately 16 Million unique, distinct "molecules", represented as SMILES, Varchar(4000) no "reactions", no "queries", no "markush" structures, just small molecules. This table contains only a integer primary key and SMILES.


I would like to index these SMILES with a JChem Cartrige index, so that we can perform indexed substructure, tanimoto, superstructure, exact... matches against the table, both on the Absolute and Canonical (no sterio) smiles.


I would like to construct the indexing, so the results are similar to Daycart search results with the Daylight default FP.


The target for the searches will currently be SMILES or SMARTS, but later we'd like to be able to use chemaxon format for more complex matching.


I am happy to either build a new table, or just index the existing table.


Could I please get examples for the following:


Your reccomendation for building and indexing the structure table (or just indexing it if you reccomend using the existing table).


Some example searches against the table.



Thank you,


~mike


 

User 7f33ec9a5c

24-08-2012 18:37:15


SQL> select cd_smiles from structure_jc1 where jc_equals(cd_structure, to_blob('CCCCCCCCCC[N+](C)(C)CCCS([O-])(=O)=O')) = 1;


select cd_smiles from structure_jc1 where jc_equals(cd_structure, to_blob('CCCCCCCCCC[N+](C)(C)CCCS([O-])(=O)=O')) = 1


                                                                                    *


ERROR at line 1:


ORA-29902: error in executing ODCIIndexStart() routine


ORA-01465: invalid hex number


ChemAxon aa7c50abf8

24-08-2012 18:42:05

I am sorry, this must work:


select cd_smiles from structure_jc1 where jc_equals(cd_structure, utl_raw.cast_to_raw('CCCCCCCCCC[N+](C)(C)CCCS([O-])(=O)=O')) = 1;


(Do you really want to work the JCB tables, BLOBs and stuff?)


P.

ChemAxon 61b4fee994

31-08-2012 09:09:32

Hi Mike,


Your input about the documentation  started a long conversation and change of mails between JChem Cartridge developer and support people, it was very useful for us! We are going to improve are documentation based on these.


The first step we have taken is about your #4 and #5 comments:


We deprecate jc_equals and jc_contains in version 5.11 in favor of jc_compare. The only thing why we had jc_equals and jc_contains is the evolution of JChem Cartridge: fist only these features were available through Cartridge and jc_compare came only later with more funcionality. So we will advise the use of jc_compare in the documentation as you suggested.


Regards,


Tamas