Similarity Search: 100% similarity - ChemAxon Forum Archive

User 3c2507de9f

06-10-2005 18:05:41

We are using MarvinSketch 3.5.5 and the 3.013 version of JChem. I recently did a similarity search on 4-tert-butyltoluene [Cc1ccc(cc1)C(C)(C)C] and recovered 6 hits from our database. Fortunately I recovered the original compound of the query [Cc1ccc(cc1)C(C)(C)C] but also recovered 5 others that are really not 100% similar:

CC(C)c1ccc(C)cc1

CC(C)(C)c1ccc(cc1)C(C)(C)C

CCc1ccc(CC)cc1

CC(C)c1ccc(cc1)C(C)C

CCc1ccc(cc1)C(C)(C)C

Is the search supposed to behave in this manner, or is something wrong in the way we set up the code for the similarity searches?

ChemAxon a3d59b832c

07-10-2005 07:40:02

Hi Dave,

I know it looks a bit weird, but I think it is correct that these structures all are 100% similar. Explanation:

1. At similarity search, not the molecules themselves, but their fingerprints are compared. In this case, the fingerprints of all these molecules are the same; so, it is natural that for the similarity search they are 100% similar.

2. You may be interested in what information is stored in the fingerprint; this explains why the fingerprints are the same:

All linear paths in the molecular graph are examined up to a length (fingerprint parameter). All of these paths set a few bits in the fingerprint, and the bits' actual place is determined by a hashing algorithm, which uses the atom and bond types of the path.
The same is done for all rings below a certain limit.

Because all of the above molecules contain exactly the same rings and paths, their fingerprints will be equal. (There are differences in the molecules in the multiplicity of some paths, but that does not make difference in the fingerprint.)

By the way, the hashed nature of the fingerprint means that the fingerprints of very different molecules may be the same, even of those, which contain very different patterns (paths and rings). However, this is very unlikely.

For more information, see:

http://www.chemaxon.com/jchem/doc/user/fingerprint.html

Best regards

Szabolcs

ChemAxon a3d59b832c

07-10-2005 10:19:48

One more comment on the starting molecules:

In general, it is also very unlikely that different molecules contain exactly the same patterns. This is especially true when heteroatoms are also involved.

You can also increase the selectivity of the fingerprint by changing the fingerprint parameters. In this case increasing the maximum pattern length from the default 6 to 7 or 8 might be beneficial. However, you should avoid making the fingerprints too dark. See the discussion in the forum topic below for advice on fingerprint tuning:

http://www.chemaxon.com/forum/ftopic905.html

(If you find the size of the discussion overwhelming, search for "generatemd" on the page.)

Best regards,

Szabolcs

ChemAxon efa1591b5a

07-10-2005 12:33:57

Hi Dave,

In my opinion these hits are correct. Consider, that you performed a similarity search where the fingerprint applied was optimized for structure search.

For similarity searching one can use longer fingerprints, 1024 bits or more. These longer fingerprints are capable of storing more information thus they can represent small structural differences better.

However, in your particular cases I don't think this would help - your structures are very similar.

I would rather use the BCUT descriptor family to better distinguish between these similar structures. I got the distance matrix below:

0.00 0.02 0.02 0.03 0.01 0.00

0.02 0.00 0.03 0.01 0.01 0.02

0.02 0.03 0.00 0.04 0.02 0.01

0.03 0.01 0.04 0.00 0.02 0.03

0.01 0.01 0.02 0.02 0.00 0.01

0.00 0.02 0.01 0.03 0.01 0.00

As you can see, BCUT is more sensitive to minor differences in structures.

What do you think? Is BCUT a good alternative in solving your problem?

Kind regards,

Miklos

ChemAxon efa1591b5a

07-10-2005 13:06:30

BTW the easiest way to calculate the BCUT values for set of structures is to use the screenmd command in the Screen package:

screenmd mols.smi mols.smi -k BCUT

This command generates the distance matrix seen in my previous post. The screenmd command works for database structures and molecular descriptors stored in database tables too. Follow this link for more details: http://www.chemaxon.com/jchem/doc/user/ScreenMD.html#usage

Also note, that in the unlikely case when BCUT does not give useful results one can try to apply other molecular descriptors. Type generatemd -L to get a list of available descriptors.

If you find that none of these descriptors provide satisfactory dissimilarity ratios you can implement your own descriptor or use a third party molecular descriptor. These can be easily integrated in the Screen package. A worked example is found in the Screen developers' guide: http://www.chemaxon.com/jchem/doc/guide/screen/index.html

Finally, a custom fingerprint implementation kindly contributed by a JChem user is found here:

http://www.chemaxon.com/forum/ftopic352.html

Hope this helps.

Miklos