GenerFP and GenerateMD - ChemAxon Forum Archive

ChemAxon efa1591b5a

19-05-2006 07:15:26

Hi Alex,

indeed, generfp is not supported any more.

However, both generfp and generatemd should generate the same results. In order to get the same output, the format has to be specified for both programs, because their defaults settings differ.

Also, in case of generfp the fingerprint length is interpreted as number of bytes rather than number of bits.

You may wish to try the commands below:

generfp -fl 128 -pl 7 -bc 3 < nci1000.sdf -fi -s t > fp.txt

generateMD c nci1000.sdf -f 1024 -n 7 -b 3 -k CF -D -o md.txt

Some remarks: the binary output cannot be used as generatemd uses a wired-in separator between bytes. It is only the decimal (integer) output format that both programs can produce in the very same way (tab separated).

Hope this helps,

Miklos

User 68d678d290

27-07-2011 17:43:45

Dear ChemAxon team,

I am trying to perform a self-dissimilarity test for a selected screening set.

I would like to use ECFP-4 fingerprints that were generated ether in decimal format

>generatemd c fm_RDS20630.sdf -f 1024 -n 4 -b 2 -k ECFP -D -g -o fpD_RDS20630.txt

or in binary one

>generatemd c fm_RDS20630.sdf -f 1024 -n 4 -b 2 -k ECFP -2 -g -o fpB_RDS20630.txt

When I tried to run compr for decimal

>compr -f 1024 -t 0.1 -g -z -i fpD_RDS20630.txt fpD_RDS20630.txt >div_RDS20630.tab

it says:

Unknown error
java.util.NoSuchElementException
        at java.util.StringTokenizer.nextToken(Unknown Source)
        at chemaxon.clustering.SpaceInputStream.loadOne(SpaceInputStream.java:109)
        at chemaxon.clustering.SpaceInputStream.loadSpace(SpaceInputStream.java:61)
        at chemaxon.clustering.Compare.run(Compare.java:470)
        at chemaxon.clustering.Compare.main(Compare.java:688)

and for binary

>compr -f 1024 -t 0.1 -g -z -i fpB_RDS20630.txt fpB_RDS20630.txt >div_RDS20630.tab

it says:

Error: For input string: "00000000|10000000|00000000 ...

Thank you very much in advance for any help

ChemAxon 4a2fc68cd1

01-08-2011 09:30:20

Hi,

The decimal representation of ECFP is a varying length list of integers, which is not supported by compr. However, the binary representation should be supported. We could reproduce the bug ("Error: For input string: ..."), we are investigating it.

Peter

ChemAxon 4a2fc68cd1

01-08-2011 10:17:31

Hi,

It seems that the current version of compr cannot handle binary representation of fingerprints (neither ECFP nor other fingerprints). The decimal format of ECFP is not supported as well, because it is a varying length list.

So I'm afraid that you cannot use the current version of compr with ECFP.

Peter

User 68d678d290

01-08-2011 13:36:20

Dear Peter,

Thank you very much for your prompt response.

Is there any chance to have an implementation of ECFP for dissimilarity test in a foreseable future?

I already did some extensive data minig using ECFP and I have to maintain the concistancy in my data (to keep reviewers calm).

Comparable with ECFP_4 1024 results I obatined on some test set using CF -f 2048 -n 8 -b 4, would take quite a bit to redo all my selections.

Thank you very much for your assistance,

Lex

ChemAxon efa1591b5a

02-08-2011 11:01:52

Hi Lex,

We will consider to provide a simple tool that converts the current binary text output of generatemd in a decimal format handled by compr. This complicates the workflow a bit but enables the use of existing software without the need of fixing and releasing which takes longer time. Would that make sense for you? Would you use this converter tool between the descriptor generation step and the comparison stage?

However, for the longer term, I'm interested in the particular use case/problem you are dealing with. Basically, I would like to understand how well the compr program meets your expectations, or what would be the ideal tool for you in this study.

Regards

Miklos

User 68d678d290

02-08-2011 14:12:46

Dear Miklos,

Thank you very much for your help.

An intermediate conversion step doesn't bother me at all, as long as I will be able to use just another form of the same structural representation by ECFP. I choose initially these fingerprints since there are numerous works on their advantages, so I don't need to explain my choice.

As for ideal solution - actually I am trying to figure out what it could be. I am trying to find some practical way (in respect to my available resourses) to make a selection of screening compounds from commercial vendors to have diverse and uniformly distributed scaffols. In my eyes an approach, proposed by Shelat, A. A.; Guy, R. K.,
Scaffold composition and biological relevance of screening libraries. Nat Chem Biol 2007, 3 (8), 442-446, looks most appealing, but resource demanding. I am hoping to mimick their approach in a more affordable way.

I just started to use compr, generally it gives comparable results (at level CF -f 2048 -n 8 -b 4) with Discovery Studio (had a chance to run several test sets on friend's workstation) - I really like compr, because of better control (for many routine tasks GUI is just waste of time) and speed of the process, although I have no opportunity (and resources) to try Pipeline Pilot.

I am a newbie in chemoinformatics, in the past I relied on other people for the selection process - but results didn't satisfy my expectations.

I do apologize if you find my answer too long or misleading, I was trying to be as sincere as possible.

Thank you very much again for your great help.

Lex

ChemAxon 4a2fc68cd1

04-08-2011 14:34:02

Hi Lex,

I attached a simple JAVA file to convert binary fingerprint files into decimal format. You can use this tool like this:

javac FingerprintConverter.java
java FingerprintConverter <input_file> <output_file>

In particular, your workflow should be modified this way:

generatemd c fm_RDS20630.sdf -f 1024 -n 4 -b 2 -k ECFP -2 -g -o fpB_RDS20630.txt

javac FingerprintConverter.java
java FingerprintConverter fpB_RDS20630.txt fpD_RDS20630.txt

compr -f 1024 -t 0.1 -g -z -i fpD_RDS20630.txt fpD_RDS20630.txt > div_RDS20630.tab

The blue lines are unmodified.

Regards,
Peter

User 68d678d290

04-08-2011 14:41:49

Hello,

One more observation related to compr.

I played with two artificially assembles sets - let's call them A and B, when B = A + 30% of random (A).

I.e. both sets are similar, or better to say identical in common sence, and also I asked to check their similarity on Discovery Studio - it shows 0.99 similarity on ECFP_4.

Whereas compr scored them as dissimilar with 0.76 score.

From the posted description of the general idea how compr works I can see the source of these discrepansies, indeed, your approach allows to calculate much faster, just let you know my observations.

Thanks again for your help,

Lex

ChemAxon 4a2fc68cd1

04-08-2011 19:34:23

Hi Lex,

Thank you for your interesting observation about compr.

Btw. have you seen my previous post? I attached a tool for converting binary ECFP outputs to decimal format, which is handled by the current version of compr.

Best regards,
Peter

User 68d678d290

04-08-2011 20:36:32

Dear Peter,

Thank you very much for the convertor, it helps a lot.

Best regards,

Lex

User 68d678d290

05-08-2011 18:29:59

Hello,

I think, it would be methodologically much more accurate to calculate the dissimilarity between two different sets as

a mean value of "minD".

At least, such approach gave me results, that correlate pretty well on my model sets with expected values.

Best regards,

Lex

ChemAxon efa1591b5a

31-08-2011 14:12:02

Hi Lex,

Shelat, A. A.; Guy, R. K., Scaffold composition and biological relevance of screening libraries. Nat Chem Biol 2007, 3 (8), 442-446, looks most appealing, but resource demanding.

I was not aware of this paper, thank you for drawing my attention to it. However, I doubt that we can implement something similar in the near future.

I just started to use compr, generally it gives comparable results (at level CF -f 2048 -n 8 -b 4) with Discovery Studio (had a chance to run several test sets on friend's workstation) - I really like compr, because of better control (for many routine tasks GUI is just waste of time) and speed of the process, although I have no opportunity (and resources) to try Pipeline Pilot.

I'm glad to hear that you found 'compr' useful and that you like the command line interface. Most users are less familiar with these sort of batch processing applications, those are more suitable for 'IT people' in general. You may also consider KNIME, an affordable alternative to PP, in which most ChemAxon tools have been integrated (made available as 'nodes').

Thank you again for the useful comments!

Miklos

User 68d678d290

31-08-2011 15:52:35

Hello Miclos,

Thank you very much for your comments.

Actually, compr in combination with JKlustor provides plenty and sufficient number of options for my analyzis - ChemAxon team makes marvelous software.

However, I would like to draw your attention to two additional papers:

1) Hassan Rezaei, Masashi Emoto, and Masao Mukaidono - New Similarity Measure Between Two Fuzzy Sets - Journal of Advanced Computational Intelligence and Intelligent Informatics 2006, 10(6), 946-953.
2) Amos Tversky - Features of similarity - Psychological Review 1977, 84(4), 327-352.

I do hope that you will not consider me rude for this, I really like how compr works, however, there are some descripansies in the final summary.

First paper proposes an approach that might help to solve the symmetry issue for two sets of a different dimension.

And the second - the seminal work on similarity - discusses problems of
symmetry in similarity more broadly. Unfortunately, in cheminformatics community it is sometimes overlooked that if two sets A(N), B(M) have N>M -
then their similarity assessment S(A,B) != S(B,A), while normalization
to calculate an average distance between their centers of weight often
is quite meaningless.

As I pointed out erlier (see my previous posts), while compr works excellent for estimation of the diversity of a library , I found some unexpected results comparing two different sets, espesially if they differ in their dimensions, nevertheless, compr provides with option -z plenty of data for accurate analysis of different sets.

It seems to me, that assymetrical calculation provides more useful information:

S(A,B) - estimation of a redundancy degree of the set A relatively to the set B

S(B,A) - estimation of a congruency degree of the set B relatively to a continuity of subsets of the set A

And I am very thankful and excited that compr enables me to do such calculations changing order of input files.

What would be really helpful is also to be able to control optionally listing up to 3 nnbs of minD to nearest neighbours.

Thanks a lot,

Lex

ChemAxon efa1591b5a

02-09-2011 12:42:00

Hi Lex,

Many thanks for the wise comments, the kind words and for the useful suggestions.

Asymmetric metrics have been introduced in some tools in the past few years, while some old applications like compr have not been updated yet. We will work on that in the future.

Best regards,

Miklos