Using GenerateMD to generate ECFP

User 13b12cda82

13-04-2013 13:42:29

Hello,


 


I am trying to use the API to generate ECFP.


When I run the code below, a table 'SSARSTRUCTURES_MD_ECFP' is generated with the following columns (CD_ID,MD_DATA) but it is totally empty. I am using the latest JCHEM jars ( 5.12.3)


The examples in the docs are very light, for when you are generating from the database. Here is my code:


The ecfp.xml file i used comes with the JCHEM downloads


 



        final ConnectionHandler connectionHandler = Connection.getConnectionHandler(1)


        try {


            final File ecfp = new File("ecfp.xml");


            assert ecfp.exists()


 


            final ECFPParameters ecfpConfig = new ECFPParameters(ecfp);


               GenerateMD generator = new GenerateMD();


            generator.setConnectionHandler(connectionHandler);


 


            generator.setStructureTableName("SSARSTRUCTURES")


            generator.setSelectStatement("SELECT CD_SMILES FROM SSARSTRUCTURES")


            generator.setDescriptor("ecfp", "CF", ecfpConfig, "");


            generator.init();


            generator.run();


            generator.close();


        } catch (Exception tt) {


            tt.printStackTrace()


        }


ChemAxon 1b9e90b2e7

15-04-2013 23:04:16

Hi Jacob,


I could reproduce your issue in a sense, that no ECFP descriptors could have been generated with the code snipplet as above. My test environment was: 1000 structures imported with JChem 5.4.2 into a MySQL (5.5) DB. Then generate ECFP descriptors with the example ecfp.xml with JChem 5.12.3.


I got the following exception:


Exception in thread "main" chemaxon.descriptors.MDGeneratorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT CD_SMILES FROM SSARSTRUCTURES' at line 1
    at chemaxon.descriptors.GenerateMD.init(GenerateMD.java:670)
    at test.ECFPDemo.main(ECFPDemo.java:41)


The fix:


The line "generator.setSelectStatement("SELECT CD_SMILES FROM SSARSTRUCTURES");" should be removed and the


generator.setDescriptor("ecfp", "FP", ecfpConfig, ""); should be changed to


generator.setDescriptor("ecfp", "ECFP", ecfpConfig, "");


 


With these minor modifications the code snipplet could generate the ECFP descriptor table correctly.


Please see this version attached.


Kind regards,


Adrian


 


 

ChemAxon 1b9e90b2e7

16-04-2013 13:40:48

Hi Jacob,


if you are interested in the Descriptors API itself, without the persistence functionality please see the attached source code example. This contains the generation of descriptors based on the parameter files, and the comparison of such descriptors according to selected metrics. (Tanimoto)


In the next, 6.0 release we will provide a type safe, redesigned API for this functionality.


Hope this helps,


Adrian

User 13b12cda82

16-04-2013 17:00:13

Thank you.


 


When i generated the ECFP's in the database, i do get a column that has BLOB data. Is there a way to deserialize the BLOB into a string, or better store the data as String instead of BLOBS?

ChemAxon 1b9e90b2e7

17-04-2013 13:18:38

Hi Jacob,


yes, there is a way to deserialize the BLOB data into various data representations. You can convert the "BLOB" field into string of zeros an ones. See the attached code example.


This data representation requires a single byte (a character) for every 0s and 1s, while the more concise native format, which is stored in db uses a single bit per each.


Hope this helps.


Reagards,


Adrian

User 13b12cda82

17-04-2013 13:56:02

Thank you. This helps, but could i do the deserialization before i store it in the database?

ChemAxon 1b9e90b2e7

17-04-2013 14:26:03










jasiedu wrote:

Thank you. This helps, but could i do the deserialization before i store it in the database?



Currently it is not possible to select the data representation of the descriptor to persist into the db.


What I would do here is after generating the descriptors in the standard format with the ChemAxon API, I would generate an extra column or a relational table containing the same descriptor in the format you need as an extra step. This extra step may be based upon the previous code example.


Please note that ChemAxon`s similarity calculations uses the our standard concise binary representation as an input. If you need some other similarity metrics that is not covered by our products, please let me know.


Kind regards,


Adrian

User 13b12cda82

18-04-2013 00:48:51










adrian wrote:










jasiedu wrote:

Thank you. This helps, but could i do the deserialization before i store it in the database?



Currently it is not possible to select the data representation of the descriptor to persist into the db.


What I would do here is after generating the descriptors in the standard format with the ChemAxon API, I would generate an extra column or a relational table containing the same descriptor in the format you need as an extra step. This extra step may be based upon the previous code example.


Please note that ChemAxon`s similarity calculations uses the our standard concise binary representation as an input. If you need some other similarity metrics that is not covered by our products, please let me know.


Kind regards,


Adrian



 


OK. Thank you. Would try it out tonight