I have generated a set of 1024 bit fingerprints with generatemd for a database of molecules. I found that the attached molecules have identical fingerprints but are not completely identical. Is this to be expected?
Yes, that is the correct behaviour as long as the pattern length (-n parameter) is less than 8.
This is because the shortest path that is found in one of your structures but not in the other one consists of 8 consecutive steps.
And this is due to the symmetry of naphtalene, one needs at least 6 steps to complete a walk (a path that returns to a node already visited), plus two more steps from the subtitutent at position 4. Is this explanation clear?
So, if you intend to distinguish these very similar structures, then you need to increase the path length to 8 at least. This, however, may have an undesired effect on the darkness of the fingerprint, namely, that too many bits will be turned on. A fingerprint statistics can help evaluate the situation (see option -T). If average number of 1 bits exceeds 60% you may need to consider to increase the length of the fingerprint to avoid further clashes.
Btw: the mol2 files are not properly imported! Nitrogen is not recognized.Thanks for the indirect bug report.