I-JCHEM 1.02 sleeps during import with large FP sizes

User 677b9c22ff

23-11-2006 02:59:33

Hi Tim,


i use I-JChem WIN v1.02 (all updates) with java -server.


I tried to to import a test DB with 250k smiles (NCI).


http://fiehnlab.ucdavis.edu/staff/kind/Collector/250k.smi.gz


and with small fingerprints set 8,1,1 this works perfect and takes ~5 min.





----------


fp, 8,1,1


Import complete. 250,000 entries processed.


Total time taken was 335s for 250k.smi


----------


However doing this with large fingerprints fp: 64,19,19


it stops and sleeps at 77,600 for several minutes (or even hours or forever).


There is enough space free for import.





Tobias

ChemAxon efa1591b5a

23-11-2006 15:28:04

Hi Tobias,





64,19,19 is an extreme choice for parameters. Such long path lengths (and number of bits per pattern) won't make the fingerprint any better.





In general, one does not need more than 2048bits as fingerprint length, let's say 4K is a reasonable upper bound.





For path length 8 is far enough for most structure sets (including the NCI), consider that rings over 6 atoms are much less common than 5 and 6 membered ones.





Regarding the number of bit per patterns: 1 or 2 is enough, above 2 the bits are correlated (not much, but they are...).





Hope this helps,


Miklos

ChemAxon 9c0afc9aaf

23-11-2006 18:28:05

Hi,





Some documentation on fingerprints:





http://www.chemaxon.com/jchem/doc/user/fingerprint.html





http://www.chemaxon.com/jchem/doc/user/RFp.html





Best regards,





Szilard

ChemAxon a3d59b832c

24-11-2006 08:08:09

You can also check out this presentation from the last user group meeting: http://www.chemaxon.com/forum/viewpost6491.html#6491





It discusses how to optimise fingerprints.

User 677b9c22ff

25-11-2006 00:53:21

Hi, thanks alot for the replies.


I just checked it again JChemManager, it says it stops at 77600 (with fp 64x19x19) but its actually at 77664, so its confirmed. I am aware that those large fps are not optimal for DB search.


What I wanted to do is tuning the import speed and the fp generation during database import a little bit, which was not very successful.


Tobias

ChemAxon fa971619eb

25-11-2006 08:27:09

The pause is caused by some whacky molecules in the NCI data set. Generating the fingerpints for these problem structures takes some time.


If you look carefully you still see a small pause at these points when you use normal fingerprint settings. But the problem is made much worse by choose innapropriately long pattern lengths.


We refer to this as the 77K problem as the most obvious example is at that point in the import!





Tim

User 677b9c22ff

26-11-2006 01:12:45

Hi,


its one of these boranes. It takes 21 minutes for fingerprint generation. Usually its 1000 mols/second. See also mdgenerate discussion here. SMILES is:


[H]1B234[H]B256B378B149B%10%11%12[H]B%10%13%14%15%16[H]B%13%17%18B%19%20%21[H]B%19%22%23[H]B%22%24%25B%20%23%26B%17%21%27B%24%26%28B%14%18%27B5%15%25%28%29B67%30B89%11B%12%16%29%30








Tobias

ChemAxon efa1591b5a

27-11-2006 08:47:37

Hi Tobias,





I doubt that setting such extreme parameter values (19 for both path length and for the the number of bits/pattern) will optimize the fingerprint in any circumstances.


Typically, path length is a value between 5 and 8, while number of bits between 1 and 3. Values outside these ranges are accepted and are good for experimenting, but don't be surpirised if you experience odd behaviour. :-)





If you would like to discuss why the above ranges are the 'normal' ones for Fp generation we are happy to help.





[Extreme parameter values affects most systems' behaviour in an undesired manner, can even make the system collapse. Think about resonance... Complex systems like the human body are more fault tolerant than simple physical ones, like fingerprint generator ;-) ]





Regards,


Miklos

User 677b9c22ff

27-11-2006 09:18:36

Agreed!


Tobias