Merge of large data sets

User 8201cee929

03-04-2007 09:29:11

Hi,


I created a local db with ~200.000 records; upload had a reasonable speed:4.000 compounds/min.


I should add more fields from an sd file but the data merge would take forever...


The merge speed is around 4 compounds/min for which I wouldn't


even use the word 'slow'.


Is there any way to speed up merge process?


(IJC 1.0)


Thank you,


Peter

ChemAxon fa971619eb

03-04-2007 14:27:35

I think this is because when a merge is being a done a query has to be done to identify the correct row, and for a large table this is slow, largely because there is no index of the column being queried. Performance would be much faster if an index was present.


IJC does not currently provide a direct way of adding an index, but we are soon going to be adding this to 2.0. But in the absence of this we can probably add an index directly to the database which should solve the problem.





Let me investigate this a bit more and I'll report back with a suggested approach.





Tim

ChemAxon fa971619eb

04-04-2007 08:35:03

I have investigated this further, and confirm exactly what you find, and that the problem is the lack of the index on the database column which you are using to merge. If you add the index the process is ~1000 times as fast (I didn't measure it as the process is so slow without the index you would need sun dial to measure it!).





The improvement is so dramatic that it makes us even more sure that we will add the abilty add indexes in IJC2.0 very soon!





The current process you need to follow is a bit lengthy, but hopefully it will be worthwhile. It uses the traditional "developer backdoor", but I think you'll find it worthwhile as your merge should now complete in a few seconds. These instructions assume you are using the embedded Derby database.





1. Open IJC. Right click the node for your database table in the Projects Window and choose 'Edit Table Definition'. Database Table Editor will open.





2. Make a note of the 'Table name' of your DB table


Then select the Field you are using for your merge with and make of note of the


'DB Column name'. You will need both of these later.





3. Make a note of the directory in which your project is stored. This is probably in you home directory ("C:\Documents and Settings\<username>\ijc-project)





4. Close all tabs in the main window - nothing must remain open.





5. Restart IJC. Only the Welcome Screen should appear.





6. From the Windows Menu choose: Runtime (should be at the bottom).





7. The Runtime Window will open. In it open the 'Databases' node, and then the Drivers node. Find the 'Derby' (not Derby (Client)') node and right click on it and chose 'Connect using...'





8. In the dialog that appears enter:


As the 'Database URL' you need to enter the location of the Derby database. To do this you need the information you noted in step 3. You will need a value such as:


jdbc:derby:C:/Documents and Settings/<username>/ijc-project/.config/localdb/db





(substitue the path to your database as appropriate).


The username and password should be empty.





9. Then click 'Next' and on the next step 'APP' should be selected as the as the schema name. Click 'Finish'.





10. The database connection node will appear as a new child node of the Databases node. Expand it and find the table you noted in step 2 and expand it.





11. Open the Indexes node for that table and right click on it and choose 'Add Index...'





12. Give the new index a sensible name and select only the column that you noted in step 2. Click 'OK'. There will be a short delay while the index is added. It will appear in you list of indexes.





13. Right click table node and choose ' Disconnect'





14. Restart IJC.





15. Try your merge again. it will be much faster.

User 8201cee929

04-04-2007 09:36:16

Hi,


It works fast with the index! That's what I needed.


Thank you!


peter

ChemAxon fa971619eb

04-04-2007 09:38:44

Yes, much faster.


This will be much easier once we add the index management to 2.0.





Tim