1-2 Millions of compound clustering - ChemAxon Forum Archive

User 1d259ba1ce

16-05-2008 14:48:43

Hi all,

I have 2 huge databases. What's the best mehod to cluster them?

Actually I've calculated the CF of the sdf files with this command:

> generatemd c input.sdf -k CF -c /usr/local/jchem/examples/config/cfp.xml -D -v -o fingerprints.txt

Now I'm trying ward (changed the HEAP_LIMIT to 1024) with this command

>ward -f 512 -g -K kelley.txt <fingerprints.txt >neighborlists.txt

Is Kelley ok to determine the best level of clustering?Is it a suitable calculation (not too time consuming...stil running after 300minutes on a Xeon 2.8GHZ)?

Is there a better/best procedure to cluster these huge databases?

Many thanks

Andrea

ChemAxon efa1591b5a

20-05-2008 09:40:37

Hi Andrea,

if your database is really huge, then neither Ward nor Jarvis-Patrick is suitable for your needs. Ward definitely not, it is a hierarchical method that is intended to cluster small set of data (i.e. <30K structures) highly accurately.

You may give a try to Jarvis-Patrick, that scales nearly linearly with respect to number of input structures. On my Mac (Core Duo 2.16GHz) it clusters 100K structures in about 80 seconds.

If you have over 1m structures, sphere exclusion is a viable choice though JKluster does not implement that particular clustering algorithm.

HTH,

regards

Miklos