Question about structure caching

User f698d0529d

22-09-2005 12:59:56

Hi





This is Jchem 3.1.1, Java 1.5.0_04, tomcat 4.1.31 on a Linux machine running Red Hat Enterprise Linux ES release 3 (Taroon Update 5), Kernel 2.4.21-32.0.1.ELsmp on an i686. The Oracle version is 9.2.0.6





I am looking at making enough memory available for the structure cache. 6 SMILES fields are jchem indexed, in three different schema, making a total of approximately 7 million SMILES.





This, then, means that according to the approximate method, I need 760M of memory available for structure caching.





My first question relates to the more exact method of calculating the memory requirement. I do not know how to work out the fingerprint size necessary for that calculation. Can you show me an example of the calculation?





My other question is about the expiration of tables from the cache.





Is it true that if one additional struture is inserted, then the entire cache must be rebuilt for that table?





Also, suppose we do not have enough memory to fully cache all 6 indexes, but we know that some indexes are used more heavily than others. Is the caching mechanism intelligent, and/or can we use weights to tell it which indexes are more valuable? I think from reading your site that caching is optional - queries can run without it?





So if only two tables were indexed, A and B, and there was only enough cache memory to cache one or the other, and A was queried much more heavily than B, would there be a way to prevent to occasional B query from expiring the A cache?





Or, if there were three tables involved, and we had enough room for two of them, when it came to replace one in the cache, would it replace the least heavily used one?





These questions are only to help me think about our memory requirements.





Thanks


Mark

ChemAxon 9c0afc9aaf

22-09-2005 14:58:06

Hi,





You can find an empirical formula for calculating the memory needs of caching here:





http://www.chemaxon.com/jchem/FAQ.html#outofmemory





Code:
Memory_need [bytes] = Number_of_molecules * (0.5*Average_smiles_length[characters] + Fingerprint_size[bits]/8 + 13.5)






(This assumes the SMILES will compress to around 1/2 of the size in the cache)





The "Average_smiles_length" refers to the cd_smiles field in JChem Base tables, or in the index tables (in case of indexing non-JChem tables in cartridge).


This column contains standardized structures, so


its size may differ slightly from the size of the original SMILES size.
Quote:
Is it true that if one additional structure is inserted, then the entire cache must be rebuilt for that table?


No, the cache is incrementally updated based on logs of the changes.
Quote:
I think from reading your site that caching is optional - queries can run without it?
In case of the JChem Cartridge it is not optional - the cache is always utilized.


With API usage and in our JSP example web application it's optional, but I would not recommend searching big tables without caching.





Currently the cache works the following way:





1. Before every search on a given table, JChem examines, if the cache has to loaded or updated.


2. If there is not enough memory during the loading / update of the table, JChem starts to throw out other tables from the cache, starting from the least recently searched table.


3. If there is still not enough memory (even the single table cannot fit into the cache), the search will be run without caching (as a last resort).





Of course both #2 and #3 should be avoided.





Your 7 million structures (around 0.7 GB) should not be a problem even for a 32 bit system where a single JRE can utilize around 2GB memory.





Please also note, that in case of the Cartridge the computational part of the Cartridge (that also stores the cache) can reside on an other machine than the Oracle server, so it doesn't have to "compete" for memory and CPU with Oracle.





Let me know if you have further questions.





Best regards,





Szilard

User f698d0529d

22-09-2005 16:30:27

Szilard


Thank you for your helpful information.





Pragmatically, I am going to increase the memory using the simple approach of 100M per 1 million plus 60M and see if that solves the caching issue.





But, just to clarify, about the formula used to calculate the memory consumption.





I am not familiar with fingerprints. So I don’t know how to work out or look up the fingerprint size in bits. When I create an index, the only parameter I specify is the TABLESPACE, which suggests that the default fingerprint size is used. I can see 16 number(10) columns called cd_fp 1 to 16 in the index table. But that would appear to suggest that the default value for the fp_size parameter is 16. I am guessing fingerprint size in bits refers to the default value of the fp_bit parameter, which I do not know.





Incidentally, I cannot get a create index statement using fp_size parameter to run, maybe I am not following your documentation correctly?





CREATE INDEX JC_IDX_TEST ON MCR_MEM2


(SMILES)


INDEXTYPE IS JCHEM.JC_IDXTYPE


PARAMETERS('TABLESPACE=data_eoaiacd_v15,fp_size:16');





ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine


ORA-29532: Java call terminated by uncaught Java exception: java.lang.IllegalArgumentException: Illegal option: fp_size:16


ORA-06512: at "JCHEM.JCHEM_CORE_PKG", line 0


ORA-06512: at "JCHEM.JC_IDXTYPE_IM", line 17





I have tried variations of it.

ChemAxon aa7c50abf8

22-09-2005 16:34:58

In addition to the memory required by the cache, you need of course memory for computation. See the input field "Amount of memory (in megabytes) reserved for temporary data (not to be included in the structure search cache)" on the WEB-based administration interface. The value entered here will be reserved and not allocated to the structure cache. The memory area required for these miscellaneous operations increases proportionately with number of concurrent users. It also depends on the total number of hits concurrently returned. Since the size of this memory area depends heavily on you usage pattern, you have to experiment to find out the right value (and at this point you will feel more in the domain of chemistry than that of software technology ;-) ).

ChemAxon 9c0afc9aaf

22-09-2005 16:57:45

Mark,





Each fingerprint column corresponds to 32 bits.


So the default size is 16x32=512 bits.





I think you should use "=" in the syntax:





PARAMETERS('TABLESPACE=data_eoaiacd_v15,fp_size=16');





The default fingerprint setting should be fine for drug-like structures.








Szilard

User f698d0529d

23-09-2005 08:49:41

Nope - already tried this one





CREATE INDEX JC_IDX_TEST ON MCR_MEM2


(SMILES)


INDEXTYPE IS JCHEM.JC_IDXTYPE


PARAMETERS('TABLESPACE=data_eoaiacd_v15,fp_size=16');





ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine


ORA-29532: Java call terminated by uncaught Java exception: java.lang.Exception: Invalid usage of the parameters tag in the create index statement.


ORA-06512: at "JCHEM.JCHEM_CORE_PKG", line 0


ORA-06512: at "JCHEM.JC_IDXTYPE_IM", line 17

ChemAxon aa7c50abf8

23-09-2005 10:16:20

The error message should actually read:
Quote:
If FP_SIZE property is defined as parameter FP_BIT and PAT_LENGTH parameters also should be defined.
Please, specify all of the three fingerprint property parameters.





Peter

User f698d0529d

23-09-2005 10:47:23

Yes that works, although I have no idea what the parameters actually do., apart from increasing or decreasing the number of integer columns in the index table. Not all of the structures are drug like in my tables.





I have read





http://www.chemaxon.com/jchem/index.html?content=doc/guide/cartridge/index.html





but it does not tell me how to find "good" values for the three parameters for index creation on a specific smiles field.





For the moment, I will just use the defaults.

ChemAxon 9c0afc9aaf

23-09-2005 12:39:18

Hi,





You can find information on the effect of fingerprint parameters at the following URL:





http://www.chemaxon.com/jchem/doc/user/fingerprint.html





Szilard

User f698d0529d

23-09-2005 15:56:22

Szilard





Yes. I had already seen that. I can't say I find it very helpful.


I have also looked at





http://www.daylight.com/dayhtml/doc/theory/theory.finger.html





just now, and this gives me a much clearer (although still very fuzzy) idea of what fingerprints actually are, but again does not tell me what sorts of parameters to use when building them.





Going back a step, the purpose of all this is for me to see if search speed on our databases could be improved by rebuilding the indexes using different parameters, now that I know caching has been addressed. We have 3 databases - only 1 is filled with drug like molecules - the other two contain varying types of molecules.





So I need to make myself happy that the parameters used to build the indexes are likely to be approximately optimal.





Can you help me with choosing appropriate values for the three fingerprint parameters in the index build?





Thanks


Mark

ChemAxon efa1591b5a

26-09-2005 15:53:53

Hi Mark,





generatemd is a tool in the JChem tool set to generate various molecular descriptors including hashed chemical fingerprints. It is a command-line tool and it supports various command line options. Among these options is -T that is particularly helpful in your case. With the -T flag generatemd creates a statistics about the fingerprints generated. (Unfortunately, this, or any other similar functionality is not available when setting up a JChem database table.)





To obtain a statistics you can either take your original input structure file that you imported in your database, or use the JChem table directly to generate fingerprints for each structure using generatemd. Note, that in case of using the database table a separate descriptor table will be created and your original fingerprints in the JChem table are not affected.





generatemd can take fingerprint parameters both either from command line and from an XML configuration file. Type generatemd -x to get help about the available command line parameters. An example config file, cfp.xml, is found in the examples/config folder in your JChem installation directory.





So, generate chemical fingerprints by generatemd and specify the -T flag in the command line. When generatemd processed all input structures it prints a summary statistics which looks sg like this:





Number of molecules = 1000


Number of bits set:


Average = 17.56%


Maximum = 60.45% (molecule 737)


Minimum = 1.95% (molecule 728)


Density function:


0%-10% 23.70%


10%-20% 40.60%


20%-30% 25.80%


30%-40% 8.00%


40%-50% 1.10%


50%-60% 0.60%


60%-70% 0.20%


70%-80% 0.00%


80%-90% 0.00%


90%-100% 0.00%





Average bit density should be around 50% and should not exceed 60%. The three parameters have different effect on the 'darkness' of fingerprint. When the length parameter (-f) is increased, i.e. to 1024 from 512, fingerprints become sparser. When maximum path length (-n) is increased, and similarly when bit count (-b) is increased the fingerprint becomes dense.


Larger fingerprint length and longer paths enhance the descriptive power of the fingerprint as long as the 50% density ratio is maintained. Shorter fingerprints cannot properly distinguish different chemical structures in some cases, their fingerprints may appear to be similar even if the structures themselves are not (simply, too much information is tried to be squeezed in a short bit string that cannot represent such wealth of information).


However, when you increase the length of the fingerprint to capture as many structural features present in your compounds as possible both the fingerprint generation and the fingerprint comparison times as well as storage space needed increase. Though both time and space scale linearly, so even 4096 bit long fingerprints are rapidly generated and compared in database search.





We found that 1024 bits, 7 bond long paths and 3 bits per features are optimal parameters in most cases. For the sake of high speed operation and minimal storage space 512, 6, 2 is used in JChemBase by default.


As a first step I suggest to try to evaluate and compare these two value sets.





I am happy to help you in such experiments, results can be useful for us too.


Please let me know if you need any further assistance.





Some useful resources:


http://www.chemaxon.com/jchem/doc/user/GenerateMD.html


examples/screen/index.html#gener_cfp under the jchem installation directory.





Regards,


Miklos

User f698d0529d

27-09-2005 10:43:21

Miklos


Thank you for your help - I think I am nearly there.


I am a little confused by the generatemd program. We are not using JChem tables, but JChem indexes on regular Oracle tables. Also, I am not sure if we have a license for generatemd (we have a license for the cartridge and a couple of other things, but I don't know about generatemd).





So I figured the best thing is to take a random sample of 1000 SMILES from each indexed table and put it in a file, then run generatemd on this, using command arguments to mimic the index creation parameters used, and see if any of the indexes would benefit from different parameters.





I have prepared the random SMILES files, and I can run generatemd to produce a report like this





Code:



[oracle@uksap22 bin]$ ./generatemd c ~/1000_smiles_ssd_version.txt -o 1000.out


-T -k CF -f 1024 -n7 -b 3


License key file not found: /home/oracle/.chemaxon/licenses.dat


Number of molecules = 999


Number of bits set:


    Average =  48.09%


    Maximum =  87.40% (molecule 735)


    Minimum =  11.91% (molecule 627)


Density function:


 0%-10%   0.00%


10%-20%   0.90%


20%-30%   7.41%


30%-40%  20.62%


40%-50%  25.43%


50%-60%  27.93%


60%-70%  13.51%


70%-80%   4.10%


80%-90%   0.10%


90%-100%          0.00%


Cell frequencies:


index    freq      %


  0       319    31.93%


  1       841    84.18%


  2       311    31.13%


  3       529    52.95%


etc








However, as far as I know when building a JChem index on a regular Oracle table, these are the parameters available, which relate to the fingerprint





CREATE INDEX JC_IDX_TEST ON TEST_MCR


(SMILES)


INDEXTYPE IS JCHEM.JC_IDXTYPE


PARAMETERS('TABLESPACE=INDX_SSD,fp_size=16,fp_bit=32,pat_length=2');





But there seems to be one missing, compared to what you discussed above. To me, it seems that the -f (fingerprint length) argument is equivalent to the product (multiplication result) of fp_size and fp_bit and the -n (maximum path length) argument is equivalent to the pat_length parameter. So what is the -b (bit count) argument equivalent to?





For example, if I wanted to see the fingerprint statistics for the index created by the CREATE INDEX statement above, what parameters should I provide to generatemd?





Thanks again.


Mark

ChemAxon efa1591b5a

27-09-2005 12:36:58

Hi Mark,





I reckon it was a good idea to pick 1000 random structures for testing.





Parameters in the create statement are as follows: fp_size is the size of the fingerprint in 32 bit units (ie. integers), so the length (in bits) is 16 * 32.


The next parameter, fp_bit is what I called bit count. By default it is 2 in JChem. 32 is too large, it makes the fingerprint black. pat_length is the maximum path length, 6 is the default value in JChem.





I believe that the cartridge does not perform that well with these parameter settings. The tests you're doing right now should suggest more feasible values.





Hope this helps.





Regards,


Miklos

User f698d0529d

29-09-2005 16:04:57

Miklos


I have run the generatemd program on 1000 smiles samples from the 6 tables involved, to see how varying the parameters results in different statistics about the fingerprints. I have attached an excel file with the results.





To my thinking (which is very vague about fingerprints), it is not just the average darkness which is the indication of the performance of the fingerprint index. There is also the spread of values (the standard deviation). To me, a narrower spread suggests a less diverse set of molecules, and presumably an index will perform consistently better on such a set. There is also the symmetry of the histogram of the values. To me, a more symmetrical spread is more desirable, as I have some vague idea that this will produce more consistent search performances.





So, based on this, and the data in the excel file attached, I have concluded that the default parameters are near optimal for the two tables in the ssd database, but that all the tables in the esd and mdl databases would benefit by changing the bit parameter from the default of 2 to 3 (ie 512 bits, 6 path length and 3 bit count), as the default parameters make the fingerprints too light, and the 512 6 3 parameters seem the best compromise of darkness, spread and symmetry.





Do you agree? Or do you suggest more investigation?





Thanks


Mark

ChemAxon a3d59b832c

03-10-2005 10:32:57

Custom24 wrote:
Miklos


To my thinking (which is very vague about fingerprints), it is not just the average darkness which is the indication of the performance of the fingerprint index. There is also the spread of values (the standard deviation). To me, a narrower spread suggests a less diverse set of molecules, and presumably an index will perform consistently better on such a set. There is also the symmetry of the histogram of the values. To me, a more symmetrical spread is more desirable, as I have some vague idea that this will produce more consistent search performances.
Mark,





Be careful here: the standard deviation is misleading, because during structure searching the fingerprints are not used as a whole but bit by bit. (Fingerprint pre-screening selects molecules for the slower atom-by-atom search. The selected molecules are those that have all bits set in their fingerprints which are in the query's fingerprint.)





At these hashed fingerprints a particular bit of the fingerprint may relate to several different structural features. This phenomenon is called collision. The denser the fingerprint is, the more likely these collisions occur. When there are too many collisions, the screening is not efficient and all "denser" structures must be searched by the slower atom-by-atom phase. This means bad search performance.





To avoid the above situation, darker fingerprints should be avoided.
Custom24 wrote:
So, based on this, and the data in the excel file attached, I have concluded that the default parameters are near optimal for the two tables in the ssd database, but that all the tables in the esd and mdl databases would benefit by changing the bit parameter from the default of 2 to 3 (ie 512 bits, 6 path length and 3 bit count), as the default parameters make the fingerprints too light, and the 512 6 3 parameters seem the best compromise of darkness, spread and symmetry.





Do you agree? Or do you suggest more investigation?
From the xls results it seems to me that your ssd tables are too dense, so you may have to consider making it lighter by:
  • Increasing the fingerprint length, and/or
  • Decreasing the pattern length and/or
  • Using 1 bit instead of 2 per feature.
I also recommend to run benchmarks on a few random molecules with your typical queries, especially if you are changing the latter two parameters. (They change the amount of information put into the fingerprints.)





esd and mdl tables seem ok with the default parameters.





Please note that there are no "too light" fingerprint from the searching's point of view, because lighter means less collisions and hence efficiency. For the storage space, lighter fingerprints in the whole table mean that a shorter fingerprint could be used with the same efficiency. However, none of your tables seem to show this "too light overall" symptom.





Best regards,





Szabolcs

User f698d0529d

03-10-2005 12:24:44

Szabolcs


Thank you for your reply. I am glad I waited, as my thoughts seem to have been the opposite of what is correct.





As you say, the only way to know for sure is to benchmark. Here are my current thoughts.





1. I don't think the "representative sample" idea will work for benchmarking. Generally, in Oracle, queries on smaller tables will not necessarily scale to those on larger tables, as the explain plan often changes, and such things as paging to disk when run out of RAM interfere. However, I don't know much about Jchem internals - the explain plan only says that the index is used in any case.





2. Although the ultimate measures of performance are how long the query takes to return the first results, and how long it takes to fetch a cursor over the entire result set, these time measures are dependent of other activity on the server at the time. Ideally, I suppose we could create several copies of one of the large tables, indexed with different parameters, and start several JChem queries on them simultaneously. Presumably, the winner would represent the best parameters. However, the problem is that I don't have enough RAM structure cache or tablespace to create several copies of the large table, or its indexes.





So I need a more independent measure of the "time" taken for the query to return. I am not sure what I can do here, as the dual connection nature of the way jchem works complicates matters. Any ideas?





mark

ChemAxon 9c0afc9aaf

03-10-2005 15:27:23

Hi,





There is no need to measure search times to determine FP performance.





The search is a 2-stage process:





1. Fingerprint screening: this is very rapid. This phase narrows down the list of potential hit structures, but still contains some non-hit structures.





2. Graph search is performed on each screened molecule.


This phase is much slower, the time is proportional to the number of screened structures.





You get the best theoretical FP performance if the number of screened structures equals the number of hits: no CPU time is wasted on the graph search of non-hit structures.





Usually the number of screened structures is quite close to the number of hits, see our benchmark on the NCI dataset for example:





http://www.chemaxon.com/jchem/FAQ.html#benchmark3





As you can see, the efficiency of screening also depends on the used query structures.





If the difference between the number of screened and the number of hits is significantly bigger in your case for similar query structures, you may try different FP settings to improve the speed.


(if these numbers are already close you cannot improve much)





In the case of the cartridge you can determine the number of screened structures by calling the jc_compare operator with search type "f".





http://www.chemaxon.com/jchem/doc/guide/cartridge/cartapi.html#jc_compare








Best regards,





Szilard