Chemical terms calculated when they are not needed?

User a11e9761d6

07-12-2013 23:50:32

Hi ChemAxon,


I'm using UpdateHandler to perform inserts with duplicate filtering enabled. I just added a new chemical term that uses acidicpKa('1') and basicpKa('1').


I was surprised to see that this slows down inserts significantly even when the structure already exists and thus the insert does not do anything. A test using 100 existing structures is now 5 seconds slower.


Am I right in presuming that the chemical term is being calculated even when the structure already exists (and thus the chemical term isn't needed)? Are there any workarounds for this?


Thanks,


Krishna


 


 

ChemAxon 42004978e8

09-12-2013 09:23:51

Hi Krishna,


Chemical terms are indeed calculated even if the structure cannot be inserted. You can avoid this unnecessary calculation if you perform a duplicate search with the given structure before inserting and insert only structures that are not found. In this case you may also switch off the duplicate filtering table option. 


I hope this solves your problem.


Regards,


Robert

ChemAxon d4fff15f08

09-12-2013 15:24:00

Hi Krishna,


 


Your assumption is correct. In case of duplicate filtering we prepare all the data of the structure needed for insertion into the given DB structure; the hash-based duplicate filtering is executed only after having all these information ready. The aim of doing so was to pave the way for a full record-level duplicate filtering, which would not only include the comparison of the structure related fields (calculated from the molecule), but also of those which could not be calculated or predicted from the structure (i.e. any custom field, for e.g.: inventory amount, internal_ID, or so).


Unfortunately at this moment there is no workaround for this case, but we will change the work-flow in order to optimize the filtering for the cases where no structure-independent column is present in the DB.


 


Irrespectively of the above, the 5s increment for tables with 100 structures seems way too much for me. Could you give me some information regarding your test? I would like to reproduce it with your settings (I've tried it with default options, and I could not get to a slowdown comparable to yours):


- What version of JChem do you use.


- what RDBMS do you have


- What is your table type


- is there anything special in your molecules that are inserted (I mean Markush properties, or something similar)


- what is the exact expression used for the CT field defined by you


 


Thanks for the info in advance.


Best regards,


Norbert

User a11e9761d6

09-12-2013 19:19:19

Hi Robert and Norbert,


Thanks for the helpful replies. Does duplicate search rely on the structure cache? Currently inserting with duplicate filtering avoids the need to load the structure cache into memory.


Regarding the setup of my test:


- JChem 6.1.3, run within Ruby using RJB


- MySQL 5.6


- Table type is 'any' (we have considered switching to 'molecules', but we're not sure we want to drop support for markush structures)


- Chemical term is below, but it seemed like simply calling "acidicpKa('1') + basicpKa('1')" took the same amount of time:


    offset = 1000;


    neutral_ph = 7.4 + offset;


    strongest_acidic_pka = acidicpKa('1') + offset;


    strongest_basic_pka = basicpKa('1') + offset;


    safe_acidic_pka = min(strongest_acidic_pka, offset * 2);


    safe_basic_pka  = max(strongest_basic_pka, -offset * 2);


    is_acidic = (neutral_ph - safe_acidic_pka) > (safe_basic_pka - neutral_ph);


    max(strongest_acidic_pka * is_acidic, strongest_basic_pka * NOT is_acidic) - offset


- The structures are ordinary and drug-like, though on the large side of drug likeness.

ChemAxon d4fff15f08

10-12-2013 13:59:31

Hi Krishna,


 


No, the duplicate search does not need the cache.


 


Thanks for the details sent, we will try to get some information on the calculation speed.


 


Best regards,


Norbert

ChemAxon d4fff15f08

02-01-2014 16:56:11

Hi Krishna,


 


I tried to reproduce the behaviour mentioned, but I couldn't really succeed, please see below:


I have used the Pubchem's Compound_000000001_000025000.sdf file containing 23071 structure from which 2 pairs are duplicates.


My steps were:


1. Importing pubchem structures without any CT into a newly created empty anytable(having duplicate filter ON)


Total number of processed molecules: 23071


Not imported (duplicates): 2


Not imported (error): 0


Successfully imported: 23069


Elapsed time: 52 seconds


 


2. REimporting the pubchem structures into the same table (duplicate filter ON)


Total number of processed molecules: 23071


Not imported (duplicates): 23071


Not imported (error): 0


Successfully imported: 0


Elapsed time: 52 seconds


 


3. Importing pubchem structures with CT column("acidicpKa('1') + basicpKa('1')") into a newly created empty anytable(duplicate filter ON)


Total number of processed molecules: 23071


Not imported (duplicates): 2


Not imported (error): 0


Successfully imported: 23069


Elapsed time: 136 seconds


 


4. REimporting the pubchem structures into the same table (duplicate filter ON, with CT column("acidicpKa('1') + basicpKa('1')"))


Total number of processed molecules: 23071


Not imported (duplicates): 23071


Not imported (error): 0


Successfully imported: 0


Elapsed time: 141 seconds


 


As conclusion we can say that there is a significant increase in the import time having a CT that needs to be calculated (step1 and step3), but the duplicate filtering does not take more time even when all the input structures should be thrown away as being duplicates (step1-step2, step3-step4).


 


I was curious about the time needed to calculate all the values that needs to be inserted in to the DB. Therefore I've initiated a recalculation on the table containing the CT (23069 molecules). This procedure reads the cd_structure value (the original structure in its original format stored in the DB), recalculates the chemical terms and overrides the values in the DB.


-for the table having a CT with expression: "acidicpKa('1') + basicpKa('1')" I have got 123.981s 


-for the table having a CT with your longer script I have got 126.718s.


as running time for recalculation procedure


 


Although these measurement are not fully comparable and their time results are not additive (in the first case we have a file IO, while in the second case the structure is read from DB, upon import other derived data - e.g. molweight, fingerprints, etc - are calculated too, etc), they seem to be realistic as their ratio too. 


 


For the moment I don't really see how could go this procedure so slow on your environment.


If you have any idea, or suggestion, it would be welcome.


 


Best regards,


Norbert

User a11e9761d6

02-01-2014 22:00:52

Hi Norbert,


Thank you for taking the time to look into this. Would you mind sharing the code that you used in your benchmarks? I am curious if the difference is due to how we are using the API when creating structures.


Krishna

ChemAxon d4fff15f08

03-01-2014 11:03:21

Hi Krishna,


 


For all the measurements I used Jchem Manager (jcman) from command line. I did this way since it seemed to be the easiest way of checking the performance (the API should not differ from this).


 


Could you try to do so on your environment too? For import procedures I've used my laptop (win7 64bit) with a MySQL 5.6 connection trough 100Mbit LAN. If you are no familiar with the usage of JChem Manager, you may find a guide here: https://www.chemaxon.com/jchem/doc/admin/ 


Should you have any questions, I would be happy to help you in getting the answers.


 


All the bests,


Norbert

User a11e9761d6

03-01-2014 20:19:42

Hi Norbert,


Thanks for that clarification. The difference we are seeing might be due to the fact that we are creating an update handler, then using it to insert each structure individually. The pseudo-code is below:



        update_handler = UpdateHandler.new(connection_handler, UpdateHandler.INSERT, table_name, "")



# for each structure:



        update_handler.setStructure(structure)

        update_handler.setIgnoreChemicalTermsExceptions(true)

        update_handler.setDuplicateFiltering(true)

        response = update_handler.execute(true)



Do you think this explains why you are seeing less of a slowdown? Given that we want to process each structure individually (performing other operations based on the result), do you have suggestions for how we could speed it up?


Thanks,

Krishna

ChemAxon d4fff15f08

07-01-2014 13:03:00

Hi Krishna,


 


Your code seems to be OK, during the command line import we do practically the same. Please give me some time, and I will be back to you with the results trying out the API.


 


Thanks, 


Norbert

ChemAxon d4fff15f08

08-01-2014 10:53:13

Hi Krishna,


 


As we tried to reproduce the behaviour described using the code, we got enlightened. The difference between importing structures with JChem Manager and the code posted by you lays in the multi-threading capability of JChem Manager. Although JChem Manager computes all structure records prior to verifying the molecule for duplicate (we will change it in the next major release), it does it on as may threads as it is allocated for the program. Parallel to this are the DB inserts triggered which are indeed processed in serial mode (one after the other). This way we measured a 5-6 fold performance difference between using the UpdateHandler in one thread and the JChem Manager (multithread) for inserting structures in DB. Consequently we would recommend you trying adopting your code to process the calculations in parallel mode.


 


I hope this information helps you getting over the performance issues experienced.


Best regards,


Norbert

User a11e9761d6

09-01-2014 20:00:00

Thank you Norbert, that helps.


How exactly do you recommend parallelizing the inserts in this situation? Multiple connection handlers? Multiple update handlers using one connection handler? I haven't looked at the documentation, but I presume we wouldn't want to use the same update handler in multiple threads.


Cheers,


Krishna

ChemAxon 25dcd765a3

10-01-2014 10:55:06

Hi Krishna,


 


We discussed this problem a bit yesterday with Norbert.


From your code it seems that you want to import structures and not update them. (Is it the case?)


If so have you tried Importer?


Here is a simple use case for that. Please check the


public static void databaseImport(..) method


 


best 


Volfi


 

User a11e9761d6

10-01-2014 19:57:35

Hi Volfi,


Thanks, it is good to know that Importer is the recommended way of doing this. We had actually planned on trying something like that, but it would require significant changes to our application. Right now our application processes all the other information in the file, and only the structures themselves are handled by JChem.


To use importer it seems like we would need to process the file twice: once with Importer, then again with our application, somehow correlating the Importer results with the appropriate line/record. This sounds tricky, but maybe I'm missing an easier way?


Krishna

ChemAxon 25dcd765a3

13-01-2014 14:07:19

Hi Krishna,


Your workflow is not clear for me, but if your application needs data from the file for further processing, as far as I see you can't avoid reading it again. However, if you want to store an appropriate field of an sdf file in your table you can use the setFieldConnections method to give the connection between the sdf field and the database column. Is this helps? So my suggestion would be to handle all import process with Importer, again it is easy to say this as I don't see the whole picture :-) .


best


Volfi

User a11e9761d6

15-01-2014 19:27:45

Hi Volfi,


Thanks for the reply. In our case there are usually many other fields, all of which need to be processed and stored in a variety of other tables, so I don't currently see a way to use the Importer without reading the file more than once. 


Krishna