Hi Krishna,
I tried to reproduce the behaviour mentioned, but I couldn't really succeed, please see below:
I have used the Pubchem's Compound_000000001_000025000.sdf file containing 23071 structure from which 2 pairs are duplicates.
My steps were:
1. Importing pubchem structures without any CT into a newly created empty anytable(having duplicate filter ON)
Total number of processed molecules: 23071
Not imported (duplicates): 2
Not imported (error): 0
Successfully imported: 23069
Elapsed time: 52 seconds
2. REimporting the pubchem structures into the same table (duplicate filter ON)
Total number of processed molecules: 23071
Not imported (duplicates): 23071
Not imported (error): 0
Successfully imported: 0
Elapsed time: 52 seconds
3. Importing pubchem structures with CT column("acidicpKa('1') + basicpKa('1')") into a newly created empty anytable(duplicate filter ON)
Total number of processed molecules: 23071
Not imported (duplicates): 2
Not imported (error): 0
Successfully imported: 23069
Elapsed time: 136 seconds
4. REimporting the pubchem structures into the same table (duplicate filter ON, with CT column("acidicpKa('1') + basicpKa('1')"))
Total number of processed molecules: 23071
Not imported (duplicates): 23071
Not imported (error): 0
Successfully imported: 0
Elapsed time: 141 seconds
As conclusion we can say that there is a significant increase in the import time having a CT that needs to be calculated (step1 and step3), but the duplicate filtering does not take more time even when all the input structures should be thrown away as being duplicates (step1-step2, step3-step4).
I was curious about the time needed to calculate all the values that needs to be inserted in to the DB. Therefore I've initiated a recalculation on the table containing the CT (23069 molecules). This procedure reads the cd_structure value (the original structure in its original format stored in the DB), recalculates the chemical terms and overrides the values in the DB.
-for the table having a CT with expression: "acidicpKa('1') + basicpKa('1')" I have got 123.981s
-for the table having a CT with your longer script I have got 126.718s.
as running time for recalculation procedure
Although these measurement are not fully comparable and their time results are not additive (in the first case we have a file IO, while in the second case the structure is read from DB, upon import other derived data - e.g. molweight, fingerprints, etc - are calculated too, etc), they seem to be realistic as their ratio too.
For the moment I don't really see how could go this procedure so slow on your environment.
If you have any idea, or suggestion, it would be welcome.
Best regards,
Norbert