User 677b9c22ff
23-11-2006 02:59:33
Hi Tim,
i use I-JChem WIN v1.02 (all updates) with java -server.
I tried to to import a test DB with 250k smiles (NCI).
http://fiehnlab.ucdavis.edu/staff/kind/Collector/250k.smi.gz
and with small fingerprints set 8,1,1 this works perfect and takes ~5 min.
----------
fp, 8,1,1
Import complete. 250,000 entries processed.
Total time taken was 335s for 250k.smi
----------
However doing this with large fingerprints fp: 64,19,19
it stops and sleeps at 77,600 for several minutes (or even hours or forever).
There is enough space free for import.
Tobias
ChemAxon efa1591b5a
23-11-2006 15:28:04
Hi Tobias,
64,19,19 is an extreme choice for parameters. Such long path lengths (and number of bits per pattern) won't make the fingerprint any better.
In general, one does not need more than 2048bits as fingerprint length, let's say 4K is a reasonable upper bound.
For path length 8 is far enough for most structure sets (including the NCI), consider that rings over 6 atoms are much less common than 5 and 6 membered ones.
Regarding the number of bit per patterns: 1 or 2 is enough, above 2 the bits are correlated (not much, but they are...).
Hope this helps,
Miklos
ChemAxon 9c0afc9aaf
23-11-2006 18:28:05
ChemAxon a3d59b832c
24-11-2006 08:08:09
User 677b9c22ff
25-11-2006 00:53:21
Hi, thanks alot for the replies.
I just checked it again JChemManager, it says it stops at 77600 (with fp 64x19x19) but its actually at 77664, so its confirmed. I am aware that those large fps are not optimal for DB search.
What I wanted to do is tuning the import speed and the fp generation during database import a little bit, which was not very successful.
Tobias
ChemAxon fa971619eb
25-11-2006 08:27:09
The pause is caused by some whacky molecules in the NCI data set. Generating the fingerpints for these problem structures takes some time.
If you look carefully you still see a small pause at these points when you use normal fingerprint settings. But the problem is made much worse by choose innapropriately long pattern lengths.
We refer to this as the 77K problem as the most obvious example is at that point in the import!
Tim
User 677b9c22ff
26-11-2006 01:12:45
Hi,
its one of these boranes. It takes 21 minutes for fingerprint generation. Usually its 1000 mols/second. See also
mdgenerate discussion here. SMILES is:
[H]1B234[H]B256B378B149B%10%11%12[H]B%10%13%14%15%16[H]B%13%17%18B%19%20%21[H]B%19%22%23[H]B%22%24%25B%20%23%26B%17%21%27B%24%26%28B%14%18%27B5%15%25%28%29B67%30B89%11B%12%16%29%30
Tobias
ChemAxon efa1591b5a
27-11-2006 08:47:37
Hi Tobias,
I doubt that setting such extreme parameter values (19 for both path length and for the the number of bits/pattern) will optimize the fingerprint in any circumstances.
Typically, path length is a value between 5 and 8, while number of bits between 1 and 3. Values outside these ranges are accepted and are good for experimenting, but don't be surpirised if you experience odd behaviour. :-)
If you would like to discuss why the above ranges are the 'normal' ones for Fp generation we are happy to help.
[Extreme parameter values affects most systems' behaviour in an undesired manner, can even make the system collapse. Think about resonance... Complex systems like the human body are more fault tolerant than simple physical ones, like fingerprint generator ;-) ]
Regards,
Miklos