Performance degradation with large databases.
I created a DB with more than 300,000 compounds and IJC is extremely, painfully slow when a simple substructure search is done.
In the specs you mentioned that it is documented that for large databases it is recommended to use ORACLE.
Will this always be the case? I'm not a DB guru so I don't know how difficult it'll be to implement ORACLE on my PC (and how much thay would cost me).
Could you give me some advise on this?
The current IJC uses the HSQL database as its local database (e.g. the one that is used by default). Whilst great for small datasets HSQL does have problems when handling large amounts of data (e.g. over 100,000 strucutres).
In the new version about to be released we have switched to a new local database (Derby) which resolves most of these problems. We have run with over 1 million structures in Derby. Old projects using HSQL will still run, but you are recommended to start a new project so that you get the benefits of this changes.
Of course the bigger the database the bigger the potential problems, and running everything (Instant JChem, JChem and the database) inside the application will always be a challenge for very large databases. This is why we recommend using an external database if you are using very large databases. IJC supports Oracle and MySQL as external databases.
MySQL is free of charge, and the XE version of Oracle can also be used free of charge (but does have some limitations in terms of database size). Note that you will need an Instant JChem license to use a remote database.
Also for very large databases (over approx 250,000 strucutures) you will need to increase that amount of memory avaialble to IJC. This is described in the help.
And one final point of note, the first structure search will be slower than subsequent ones, as the JChem structure searching cache needs to be set up, and the time this takes is approximately linear with relation to database size. For large databases (e.g. over 100,000 structures) this startup delay is noticeable during the first search, but subsequent searches should be very fast.
I'll wait to see/use the very-very-soon-to-be released version to try it with large databases.
Now, interesting comment about the first search being slow since the search cache needs to be created.
Question: will this be the case every time IJC is run for the first time and a search is run? In other words, will the cache have to be recreated?
I am really looking forward to seeing the next version! :)
When IJC closes the cache is lost and needs to be re-created from the database next time it is needed. The cache stores the chemical fingerprint information and allows rapid searching. This is the case for any JChem searching, not just in Instant JChem.
For "nomal" databases this is hardly noticeable. But for large ones you will notice a delay for the first search.
No worries about cache reload time.
Loading the cache for 250K structures (NCI database) takes only 8s on my machine from a local Derby database.
Win XP, 1GB RAM, 2GHz Pentium mobile proc.
400MB max. heap space is allowed for Instant JChem.
The Derby DB seems to make an impresive work for performance.
I'll be waiting for the anouncement of the next release. ;)