'sphere exclusion' clustering - the speed of performance

User 91f8768a43

09-06-2011 14:58:02



Could you please advice me on some cluster performance issues.

I need to perform the clustering for a number of compound libraries with ECFP fingerprints. As I understood the 'sphere exclusion' clustering is the most suitable and fastest method in JKlustor for these purposes. I started calculation for the set of 300k compounds about twenty-four hours ago and it's still running. Should it take so long? At this case how long will take to cluster 1-2 mln compounds librariy? Is there any way to increase the speed of calculation?



System specifications:

Windows XP x64 Edition

Intel(R)Core(TM)i5 CPU

760 @ 2.80 GHz

RAM: 4Gb



Command line:

jklustor -c sphex:0.2 -d ecfp carb.smi -o "wrclus:smiles:carb_clus.txt:descs"







P.S. As it can be seen from Task Manager the process uses 32bit Java(1processor and up to 500 Mb of memory), while 64bit Java is present in the system and perfectly works for InstantJChem.

User 91f8768a43

20-06-2011 21:10:32

Is there anybody who can help me with that question? Is anybody did JKlustor calculations for thousands compounds?

ChemAxon 8b644e6bf4

20-06-2011 21:34:39

Sorry for the late answer.  Consider using -v to turn on verbose mode which will pront out progress messages during the clustering process. In case of sphere exclusion (and also bemis-murcko) clustering is done during input time.


Increasing dissimilarity threshold (0.2 in yor case) will decrease cluster count and speeds up clustering process.


Regards,


Gabor

User 68d678d290

25-08-2011 14:36:17

Dear Gabor and Bakhtiyor,


 


I recently also tried to use jklustor sphere exclusion clustering with "-d ecfp" option for 250k set.


I was running more than 3 days and then failed, but with default descriptor (CF fingerprints - I persume) it gave me results within 20 min.


Is it possible that we are facing here some intrinsic problem with  ECFP similar to issues with "compr", see my post: https://www.chemaxon.com/forum/viewpost36439.html#36439


 


Thank you very much for your suggestions,


 


Lex

User 68d678d290

26-08-2011 14:01:20

I have to bring my apologies, there was obviously some system failure happening when I used sphex with "-d ecfp:tanimoto" option.


I recently triet it once more and it worked just fine, although much slower than the default cfp.


By the way, the default (CFP, as I understand) has produced more meaningful for me clustering results for compounds with fused heterocyclic systems. So, I am curious what parameters are used in JKlustor for both types of fingerprints:


bond depth, bit length, number of bits?


Is it possible to make any ajustements for these parameters for JKlustor in some XML configuration file?


 


Thank you very much in advance for your suggestions,


 


Lex


 

ChemAxon 8b644e6bf4

30-08-2011 02:28:56

Dear Lex,


 


It seems that ecfp tends to find higher dissimilarity values than cfp. Currently sphere exclusion dissimilarity radius parameter should be adjusted depending on the used fingerprint type.


The default fingerprint parameters are used and they can not be modified, but it is a planned feature.


A brief introduction to sphere exclusion parameter tuning will be linked here in the next days.


Regards,


Gabor

User 68d678d290

30-08-2011 13:53:05

Dear Gabor,


 


Thank you very much for your prompt response.


If it is possible, would you mind to tell what are the default parameters for both fingerprints bundled with JKlustor.


It might be helpful for publishing data as well as to figure out the range for the sphere radius ajustment.


 


Thank you very much in advance,


 


Lex

ChemAxon 8b644e6bf4

02-09-2011 03:18:26

Dear Lex,


Sorry to the late answer.


CFP config XML, (as far as i know also available on a locally installed jchem in examples/config):


<ChemicalFingerprintConfiguration Version="0.3" schemaLocation="cfp.xsd"><Parameters Length="1024" BondCount="7" BitCount="2"/><StandardizerConfiguration Version="0.1"><Actions><Action ID="aromatize" Act="aromatize"/></Actions></StandardizerConfiguration><ScreeningConfiguration><ParametrizedMetrics><ParametrizedMetric Name="Tanimoto" ActiveFamily="Generic" Metric="Tanimoto" Threshold="0.2"/><ParametrizedMetric Name="Euclidean" ActiveFamily="Generic" Metric="Euclidean" Threshold="15"/></ParametrizedMetrics></ScreeningConfiguration></ChemicalFingerprintConfiguration>


 


ECFP config XML:


<ECFPConfiguration Version="0.1"><Parameters Length="1024" Diameter="4" Counts="no"/><IdentifierConfiguration><!-- Default atom properties (switched on by Value=1) --><Property Name="AtomicNumber" Value="1"/><Property Name="HeavyNeighborCount" Value="1"/><Property Name="HCount" Value="1"/><Property Name="FormalCharge" Value="1"/><Property Name="IsRingAtom" Value="1"/><!-- Other built-in atom properties (switched off by Value=0) --><Property Name="ConnectionCount" Value="0"/><Property Name="Valence" Value="0"/><Property Name="Mass" Value="0"/><Property Name="MassNumber" Value="0"/><Property Name="HasAromaticBond" Value="0"/><Property Name="IsTerminalAtom" Value="0"/><Property Name="IsStereoAtom" Value="0"/></IdentifierConfiguration><StandardizerConfiguration Version="0.1"><Actions><Action ID="aromatize" Act="aromatize"/><RemoveExplicitH ID="RemoveExplicitH" Groups="target"/></Actions></StandardizerConfiguration><ScreeningConfiguration><ParametrizedMetrics><ParametrizedMetric Name="Tanimoto" ActiveFamily="Generic" Metric="Tanimoto" Threshold="0.2"/><ParametrizedMetric Name="Euclidean" ActiveFamily="Generic" Metric="Euclidean" Threshold="10"/></ParametrizedMetrics></ScreeningConfiguration></ECFPConfiguration>


It might be helpful for publishing data as well as to figure out the range for the sphere radius ajustment.


Internally JKlustor uses 0 .. 1 dissimilarity range, usually 0 as the most similar (identical fingerprints) and 1 for the possible most dissimilar values (*).  Generally starting sphere exclusion clustering from high (even 0.9) dissimilarity radius and checking cluster sizes (with following import using "-v" option to turn on verbose mode or using "-o wrstat" option to obtain statistics) while decreasing seems to be a useful approach.


JKlustor web gui provides the "matrix" view to compare cluster representants/centroids and individual structures dissimilarity values. Also fingerprint binary representation is visualized on the individual structures page.


Online demo available at http://discoverygroup.chemaxon.com/MGSandbox/jkdemo.jsp (select structures to fetch from URL or upload, set clustering parameters and launch jklustor) or locally you can use jklustor option  "-s 88" to lauch web server mode after clustering and connect to it by opening http://localhost:88 in a browser.


(*) For LibraryMC(E)S a metric called "commonbits" implemented which calculated by dividing simultaneously set bits count by fingerprint length and subtracting the result from 1.0. This dissimilarity metric will not give 0 for identicall structures.


 


Regards,


Gabor

User 68d678d290

02-09-2011 13:50:35

Dear Gabor,


 


Thank you very much for the configuration files.


I am sorry for a silly question, but how can I be shure that JKlustor will read a particular configuration file.


In other words, in which directory (is it '\JChem\examples\config') should I place these .XML files or how can I specify certain path in a command string to my custom configuration .XML's?


Additional naive question, if I get it right - parameters given in these configuration files are the default parameters for the fingerprints whenever they are called by any ChemAxon subroutine?


 


Thank you very much for your great support,


 


Lex


 

ChemAxon 8b644e6bf4

06-09-2011 14:31:04

Dear Lex,


 


Sorry for the misunderstanding.


In JKlustor the default fingerprint parameters are used which parameters are hardwired in the code. These can not be modified in JKlustor; using some kind of paramateriztaion is a planned feature in the near future. In 5.7 the verbose mode in JKlustor will be extended to print used main parameters (length, etc).


The referenced files contains fingerprint congfiguration examples (which can be used in other products); the main fingerprint parameters in those files match to the hardwired defaults. Configurations in these example files are not (and can not) read by JKlustor.


The actual hardwired configuration used in cfp (this modification in the contents of cfp.xml will be corrected in release 5.7):


<?xml version="1.0" encoding="UTF-8"?>

<ChemicalFingerprintConfiguration Version ="0.3" schemaLocation="cfp.xsd">

    <Parameters Length="1024" BondCount="7" BitCount="2"/>

    <ScreeningConfiguration>
        <ParametrizedMetrics>
            <ParametrizedMetric Name="Tanimoto" ActiveFamily="Generic"
                Metric="Tanimoto" Threshold="0.2"/>
            <ParametrizedMetric Name="Euclidean" ActiveFamily="Generic"
                Metric="Euclidean" Threshold="10"/>
            <ParametrizedMetric Name="Tversky" ActiveFamily="Generic"
                Metric="Tversky" Threshold="0.5" TverskyAlpha="1" TverskyBeta="1"/>
        </ParametrizedMetrics>
    </ScreeningConfiguration>

</ChemicalFingerprintConfiguration>


regards,


Gabor

ChemAxon 8b644e6bf4

09-09-2011 02:33:25

Dear Lex,


A brief introduction to sphere exclusion parameter tuning will be linked here in the next days.



This introduction to sphere exclusion clustering and paramtere handdling available at https://docs.google.com/document/pub?id=1C5xYiV4Gk_UWSWV2UQ-PhWKPGBkZitnOlKICT2qXHKk .


A tracker topic where notifications on the major modification of this document can be found is available at https://www.chemaxon.com/forum/ftopic8015.html style="font-size: 11pt; font-family: Arial; color: #000000; background-color: transparent; font-weight: normal; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">




If you have any further questions please do not hesitate to ask them,


Regards,


Gabor


 


 

User 68d678d290

09-09-2011 13:24:09

Dear Gabor,


 


Thank you very much for your help.


Actually, the most important pice of information for me was to know that default parameters correspond to my personal preferences for clustering, i.e.


ECFP Parameters: Length="1024" Diameter="4", as I understand bit counts aren't used for ECFP.


Also I am curious, whether FCFP could be invoked on its own.


 


Many thanks,


Lex



ChemAxon 8b644e6bf4

13-09-2011 14:23:28

Dear Lex,


Also I am curious, whether FCFP could be invoked on its own.


Could you please clarify this question?


Regards,


Gabor

User 68d678d290

13-09-2011 16:21:59

Dear Gabor,


 


According to your release note


http://www.chemaxon.com/news/marvin_jchem5-4_launched/


"...Screen – (Fast/robust 2D and now 3D ligand-based virtual screening)


- Extended connectivity fingerprint (ECFP) now available (includes FCFP)


- Available both as hashed binary fingerprint and as a list of integer features..."


I thought that FCFP is also implemented.


May be I got it wrong, and FCFP is available only within Sreen module (http://www.chemaxon.com/products/screen/)?


 


Thanks,


Lex


ChemAxon efa1591b5a

13-09-2011 16:40:25

Hi Lex,


 


Indeed, ECFP/FCFP are fully supported in Screen. At present FCFP cannot be used in JKlustor, I'm afraid.


Regards,


Miklos