# Need to compute a similarity matrix for 250,000 compounds

User f28a846261

09-10-2010 16:11:57

We have about 250,000 small drug-like molecules in BindingDB (www.bindingdb.org) and would
like to make a similarity matrix for them.  This would either be an
all-by-all numerical similarity matrix with a Tanimoto in each box; or
it could be all 0s and 1s, where 1 means the similarity is above some
threshold.

Generating this matrix by a naive method would require something like 20 billion compound-compound comparisons, which will be a long job. Is there a strategy to make this calculation
manageable?

Thanks,

Mike

ChemAxon efa1591b5a

11-10-2010 07:39:06

Hi Mike,

Tough but very challenging. We'll do a quick brainstorming today and then we'll get back to the forum and share our thoughts with you.

Regards,

Miklos

ChemAxon efa1591b5a

13-10-2010 11:19:45

Dear Mike,

Some ideas...

1, Sort the structures by size, and compute the similarity ratio of those that have similar number of atoms or bonds.

The idea behind this approach is that structures of different size are less likely to exhibit high similarity. (Well, if the smaller one is a substructure of the larger one then this is not necessarily true, it depends on the notion of similarity...)

From the similarity matrix point of view it means that rows and columns are associated with the sorted structures, thus only the elements of the subdiagonals are computed. Not only the subdiagonal next to the diagonal, but some further ones within a given distance, lets say 100-1000 is probably enough (let's denote this by k). Since only the upper (or lower triangle has to be computed (well, if your similarity function is symmetrical) and without the diagonal, this means that instead of n * ( n - 1 ) / 2 you only need k * ( 2 * n - k - 1 ) / 2 elements. And as k <<< n (much smaller) thus instead of O( n^2 ) comparisons (and storage space) O(n) computation and O(n) storage is enough. Well, sorting may take O(n * log n) steps, but for a 250K input it's controllable.

2, Compute the k nearest neighbours of each input structure and calculate the similarity for the nearest neighbours only.

This may sound contradicting since to compute the k-nearest neighbours may require at least O(n ^ 2) computation. However, we developed an approximate algorithm that uses locally sensitive hashing functions to compute the nearest neighbours in O(n) time. This is not yet in production but we make it available for you somehow.

3, Use fast clustering to group your input set and calculate the similarities between elements of the same clusters only.

For example the sphere exclusion algorithm (available in JKlustor) can perform a crude clustering in O(n) time. The cluster size, however, can only be upper bounded by O(n), thus to compute the similarities within the clusters can take longer time, and it is hard to control the size of individual clusters. Yet, it is worth giving a try to this approach as well.

What is your requirement? Get some ideas form us and you work with your team, or we can collaborate on this problem, or you expect us to provide full solution?

Kind regards,

Miklos

User f28a846261

14-10-2010 21:19:06

Thanks Miklos.  Let me share your ideas with my team and get their thoughts.

Best,

MIke

User c31567e5e3

25-10-2010 04:13:22

Another approach is to exhaustively fragment your library (e.g., something along this line) and only do pairwise calculations within members of each unique fragments.  The nice thing here is that the fragments can serve as loose lower bounds to your similarity calculations.  This approach is also easily parallelizable.  We've done something like this on a larger library (though instead of similarity we calculated pairwise MCS).

Good luck!

User f28a846261

05-11-2010 00:45:47

Tiqing and I did some testing of MW vs similarity, and we find that a large difference in MW is not sufficiently predictive of a low similarity to be useful here.  I find some compound pairs with similarity >0.8 but with a MW difference of 300 or more.  This occurs when the larger molecule has a decent-sized piece that is a close match to the smaller molecule.  So these similarities seem reasonable (both molecules might well bind the same protein target), and the weak relationship with MW does not seem to be an error.

To enlarge a bit, if I discard all pairs of compounds with a MW difference greater than 300 or so, this eliminates only about 5% of pairs. So there isn't much computational benefit to it.

Plan B?

best

Mike

ChemAxon efa1591b5a

17-11-2010 10:58:26

Hi Mike,

Interesting results, thanks for sharing them with us.

Did you consider the 2nd and 3rd option? Those do not rely on this false assumption.

Regarding this 1st approach: perhaps count the number of 1 bits in the fingerprints and sort your structures by brightness. Those, that have similar number of 1 bits may be more likely to exhibit higher similarity than those that have very different 1 bit counts. Do you think that this is a viable assumption?

Again, calculate the similarity for some sub diagonals in the vicinity of the diagonal.

What do you think?

BR

Miklos

User f28a846261

18-11-2010 02:25:40

Hi Miklos,

Thanks for your further thoughts on this.  We'll definitely consider your suggestions. However, I think that Tiqing is already partway through the brute-force approach; I think he has done 1/8 of the whole matrix so far.  Although this uses a lot of computer time, it may end up using less human time, which is for us more valuable.  We'll see how this goes.

Best,

Mike

ChemAxon efa1591b5a

18-11-2010 14:55:58

Hi Mike,

That's interesting, it'd be good to learn running times and storage usage.

But what happens if you need to recalculate all similarity scores for some reasons, like change of descriptor or metric being used?

Best regards

Miklos

User c7b783981c

21-09-2011 14:37:54

Hello everyone. As I am bit new to Chemaxon tools..

Can anybody tell me please how to make a similarity matrix for some 60000 compounds with their smiles on 1st row and 1st column of the matrix.

well it is possible using JChem for Excel but I am a linux user .... any Chemaxon tool to do that.....

please send the name of the tool if possible detail explanation how to do that with that tool.

please suggest something it is urgent!!!

ChemAxon efa1591b5a

26-09-2011 01:25:24

Hi,

There's no off-the-shelf solution for this particular problem in the JChem suite.

You either (1) have to write your own java program (fairly simple if you have some java experience, though complexity also depends on where do you store your structures, in a structure file or in a database),

or alternatively you can (2) use the compr program which however does not store the entire similarity matrix only the most similar ones per each row.

If you chose the first option the attached sample code can be helpful.

Regards

Miklos

User c7b783981c

26-09-2011 17:18:04

thank a lot for your quick help !!!

Actually the problem I have given takes much larger time in JChem for excel.....

Also I am not so much familiar with java rather I am with python....

So can I integrate this java code in my python code.....I mean to say that will it call the getTanimoto of ChemAxon plugin???

User c7b783981c

27-09-2011 17:27:26

 tanmay3 wrote: thank a lot for your quick help !!! Actually the problem I have given takes much larger time in JChem for excel..... Also I am not so much familiar with java rather I am with python.... So can I integrate this java code in my python code.....I mean to say that will it call the getTanimoto of ChemAxon plugin???

ANOTHER HELP!!

i AM TRYING TO RUN THE JAVA CODE

javac TanimotoDissimilarity.java

but IT IS SHOWING SOME ERRORS:

TanimotoDissimilarity.java:18: package chemaxon.descriptors does not exist
import chemaxon.descriptors.*;
^
TanimotoDissimilarity.java:19: package chemaxon.struc does not exist
import chemaxon.struc.Molecule;
^
TanimotoDissimilarity.java:20: package chemaxon.formats does not exist
import chemaxon.formats.MolImporter;
^
TanimotoDissimilarity.java:36: cannot find symbol
symbol  : class Molecule
location: class TanimotoDissimilarity
Molecule mol1 = MolImporter.importMol( args[ 0 ] );
^
TanimotoDissimilarity.java:36: cannot find symbol
symbol  : variable MolImporter
location: class TanimotoDissimilarity
Molecule mol1 = MolImporter.importMol( args[ 0 ] );
^
TanimotoDissimilarity.java:37: cannot find symbol
symbol  : class Molecule
location: class TanimotoDissimilarity
Molecule mol2 = MolImporter.importMol( args[ 1 ] );
^
TanimotoDissimilarity.java:37: cannot find symbol
symbol  : variable MolImporter
location: class TanimotoDissimilarity
Molecule mol2 = MolImporter.importMol( args[ 1 ] );
^
TanimotoDissimilarity.java:40: cannot find symbol
symbol  : class CFParameters
location: class TanimotoDissimilarity
CFParameters pars = new CFParameters();
^
TanimotoDissimilarity.java:40: cannot find symbol
symbol  : class CFParameters
location: class TanimotoDissimilarity
CFParameters pars = new CFParameters();
^
TanimotoDissimilarity.java:43: cannot find symbol
symbol  : class ChemicalFingerprint
location: class TanimotoDissimilarity
ChemicalFingerprint cfp1 = new ChemicalFingerprint(pars);
^
TanimotoDissimilarity.java:43: cannot find symbol
symbol  : class ChemicalFingerprint
location: class TanimotoDissimilarity
ChemicalFingerprint cfp1 = new ChemicalFingerprint(pars);
^
TanimotoDissimilarity.java:44: cannot find symbol
symbol  : class ChemicalFingerprint
location: class TanimotoDissimilarity
ChemicalFingerprint cfp2 = new ChemicalFingerprint(pars);
^
TanimotoDissimilarity.java:44: cannot find symbol
symbol  : class ChemicalFingerprint
location: class TanimotoDissimilarity
ChemicalFingerprint cfp2 = new ChemicalFingerprint(pars);
^
13 errors

IS THERE ANYTHING ELSE i AM MISSING TO INSTALL IN Jchem....??

ChemAxon efa1591b5a

27-09-2011 17:34:16

Hi,

you need to set the CLASSPATH to the jchem.jar file.

Miklos

User c7b783981c

27-09-2011 18:30:42

thanks a lot it is working now....

ChemAxon efa1591b5a

29-09-2011 13:26:00

Great, thanks!

User c7b783981c

12-10-2011 09:18:28

problem...slow code

hello sir,

I am sorry to disturb you again...

but when I tried the code on a smiles format file .(reading each smiles of the fie in a python code) and running the java code you have attached (every times i read a smiles) by making it a subprocess of the python code but it is taking very long time .

I have run my python code (yesterday afternoon) on a smiles format file which contains only 700 smiles but it is running still running. (I have to run it on 60000 or more)

I have thought if my code is slower and individually checked them by time but actually when (for testing) I tried your code on terminal directly using two smiles , it is taking at least two seconds.

I have tried another chemaxon tool "evaluate" but anyhow I am not able call the tool from python.

(I have attached the code).

ChemAxon efa1591b5a

13-10-2011 20:50:31

Hi,

I'm afraid we cannot fix your program, our support does not include this kind of assistance.

Note, that the calculation a the full similarity matrix is a computationally intensive task since it scales quadratically. That means that for large input it can take long time to complete the calculation. I refer to the first few posts of this forum thread that elaborate on this issue.

Also bear in mind that the sample code attached to a previous post is just an example and it is not intended for "production". You may need to work on it to improve performance.

Regards

Miklos

User c7b783981c

14-10-2011 09:54:29

Anyways thanks!!!!

User 458270e037

17-10-2011 03:04:15

Hi folks,

I guess our SketchSort algorithm can solve the problem of calculating a large similarity matrix in terms of Tanimoto similarity.

Check out our paper

Y. Tabei and K. Tsuda. SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints. Molecular Informatics, 30(9):801-807, 2011.

and source code

Thanks!

Koji

User 677b9c22ff

15-12-2011 23:11:23

 tanmay3 wrote: problem...slow code hello sir, I am sorry to disturb you again... but when I tried the code on a smiles format file .(reading each smiles of the fie in a python code) and running the java code you have attached (every times i read a smiles) by making it a subprocess of the python code but it is taking very long time . I have run my python code (yesterday afternoon) on a smiles format file which contains only 700 smiles but it is running still running. (I have to run it on 60000 or more) please help.. (I have attached the code).

Tanmay(?),

your code suffers from serious I/O and file system overhead, you are calling an external function (penalty), you are calling a DB (overhead, btw you also posted your PWD) and then your implementation also has some issues:

1) parallelization is missing - split and distribute (for pairwise calculations)

2) its faster to calculate and load all the fingerprints into memory and do the popcount or FC in memory.

60,000 molecules can be done very easily, a 5 year machine can calculate 1,000,000 tanimotos per second and

that said you can 60k mols in an hour with old hardware and no parallelization only.

I just calculated 10k molecules in 5 minutes with the ChemAxon pairwise Tanimoto fingerprints.

That is not optimal, but also not slow. So the bottleneck is within your implementation.

Cheers

Tobias

User c7b783981c

16-12-2011 10:11:02

WELL YOU HAVE TAKEN TOO LONG TO ANSWER

BUT I HAVE FIXED THIS PROBLEM TILL OCTOBER END ITSELF,

AND IT IS RUNNING  QUITE WELL FOR 100000 COMPOUNDS.

STILL THANK YOU FOR YOUR EFFORT.

ChemAxon efa1591b5a

16-12-2011 13:48:06

Tobias, many thanks for giving great support.

Miklos