Tanimoto similarity between two compounds

User 8052c55234

14-07-2014 15:57:14

I am a new JChem Marvin user, I appreciate it a lot

I actually compute Tanimoto dissimilarity with API and with a command line:

I obtain same fingerprint with API and command line, but Tanimoto calculated with the API (Ecfp1.getTanimoto(ecfp2) =0.8898305 ) is totally different than the Tanimoto calculate with the command line:


compr -f 1024 -t 1.0 -r -g -z -L -i fingerprints.txt fingerprints.txt -o data.txt  (Tanimoto here = Maximum dissimilarity between sets = 0.60294116 because there is 2 compounds)


 


I put the sdf to test, the data.txt obtained with the command line

Any help please?
I check the Tanimoto with the binari fingerprint and I obtained same value between this way and the API way. What do I do wrong with command line?


 


here after the way I do the API Tanimoto computation:

ECFPParameters ecfpparam = new ECFPParameters();
ECFP fp;


ecfpparam.setLength(1024); 


ecfpparam.setDiameter(8);


ecfpparam.setKeepCounts(false);


ECFP fp = new ECFP(ecfpparam);


ECFP fpSave = new ECFP(ecfpparam);


MolImporter mi = new MolImporter(filename);


Molecule m = mi.read();  


while (m != null) {                  

            // Instantiate default descriptor parameters and descriptors        
             ecfp = fp.generate(m);
          
            System.out.println("Tanimoto: "+ecfp.getTanimoto(ecfpSave)));


             ecfpSave=ecfp;


}


I obtained in the second position the Tanimoto number of interrest:


Tanimoto: 0.8898305


In agreement with binary fingerprint

User 8052c55234

15-07-2014 07:57:43

I found it....


generatemd used with ECFP gives false decimal format for compr, jarp, etc

This gives false Tanimoto dissimilarity results with the command line after.

Clearly, these kind of informations have to appear in the manual. A complementary file, FingerprintConverter.java (attached to this message and extract from a hard to find old topic of 2011) have to be add to the library JChem with little explanation about the reason of its presence in my point of view.


https://www.chemaxon.com/forum/ftopic1472.html&sid=6b97639ee572ae0ac11ae2bd6bc2e6a8


 


To avoid any error when command line is used. To obtain a runable file of it:


javac FingerprintConverter.java



And the goal to obtain correct Tanimoto values with ECFP and command line is to generate binary fingerprint files with generatemd:


generatemd c file.sdf -k ECFP -2 -o BinaryFingerprints.txt


java FingerprintConverter BinaryFingerprints.txt correctDecimalFingerprint.txt


compr -f 1024 -t 0.1 -r -g -z -L -w -i correctDecimalFingerprint.txt  correctDecimalFingerprint.txt -o resultsData.txt


ChemAxon 8b644e6bf4

18-07-2014 15:41:13

Sorry for missing your original post. I agree that the default decimal representation of ECFP descriptor family (list of feature identifiers) might causes confusion since one can expect a packed binary string representation. In the new descriptors API (under construction) we expose the folded binary representation by default.


regards,


Gabor