Dear Mike,

Some ideas...

1, Sort the structures by size, and compute the similarity ratio of those that have similar number of atoms or bonds.

The idea behind this approach is that structures of different size are less likely to exhibit high similarity. (Well, if the smaller one is a substructure of the larger one then this is not necessarily true, it depends on the notion of similarity...)

From the similarity matrix point of view it means that rows and columns are associated with the sorted structures, thus only the elements of the subdiagonals are computed. Not only the subdiagonal next to the diagonal, but some further ones within a given distance, lets say 100-1000 is probably enough (let's denote this by *k*). Since only the upper (or lower triangle has to be computed (well, if your similarity function is symmetrical) and without the diagonal, this means that instead of *n * ( n - 1 ) / 2* you only need *k * ( 2 * n - k - 1 ) / 2* elements. And as k <<< n (much smaller) thus instead of O( n^2 ) comparisons (and storage space) O(n) computation and O(n) storage is enough. Well, sorting may take O(n * log n) steps, but for a 250K input it's controllable.

2, Compute the *k* nearest neighbours of each input structure and calculate the similarity for the nearest neighbours only.

This may sound contradicting since to compute the *k*-nearest neighbours may require at least O(n ^ 2) computation. However, we developed an approximate algorithm that uses locally sensitive hashing functions to compute the nearest neighbours in O(n) time. This is not yet in production but we make it available for you somehow.

3, Use fast clustering to group your input set and calculate the similarities between elements of the same clusters only.

For example the sphere exclusion algorithm (available in JKlustor) can perform a crude clustering in O(n) time. The cluster size, however, can only be upper bounded by O(n), thus to compute the similarities within the clusters can take longer time, and it is hard to control the size of individual clusters. Yet, it is worth giving a try to this approach as well.

What is your requirement? Get some ideas form us and you work with your team, or we can collaborate on this problem, or you expect us to provide full solution?

Kind regards,

Miklos