Slow Exact/Duplicate searching - ChemAxon Forum Archive

User 773d472e7f

26-06-2013 15:55:21

Please advise if I have implemented this registeration procedure corectly:

Server: Windows Server 2012 64 bit | Oracle 11.2 | JChem Cartridge 5.12

Domain Index built on a regular structure table against a CLOB column containg molfile data.

PLSQL code controlling the addition of molecules into a regular structure table : AKOS_MOLTABLE. In summary:

1. A candidate molecule is DUPLICATE searched against the AKOS_MOLTABLE:

select AKOS_Number, Published into pAKOS_NUMBER, pPublished from AKOS_MOLTABLE where jc_compare(CTAB_CLOB, pCTAB, 't:d maxHitCount:1') = 1;

2. If it does not exist then the molecule data is inserted into the AKOS_MOLTABLE:

insert into AKOS_Moltable(CTAB_CLOB) values (pCTAB) returning AKOS_NUMBER into pAKOS_NUMBER;

As per the cartridge documentation I am using regular sql insert to do this because this is a regular table.

I have attached the complete code of these two functions in the attached text file.

how is the DOMAIN index maintained during the update process?

Must I call an update domain index procedure after every insert?

Is the domain index automatically updated?

Thanks for your help:

Bernard D'Alwis

ChemAxon aa7c50abf8

27-06-2013 09:16:22

Hi Bernard,

The index is updated automatically, no need to execute any special procudure -- with the following restriction:

With the "normal" way of operation, you have to call commit after any updates to include the update in the next search.

For duplicate search, the requireCommit search option is provided (http://www.chemaxon.com/jchem/doc/dev/cartridge/cartapi.html#jc_compare_requireCommit) to avoid the need for constant commit.The search with this option is typically only marginally slower than without and the performance gain from the absence of commits copiously offsets it.

Peter

User 773d472e7f

27-06-2013 10:00:19

Hi Peter,

Thanks for this information, which confirms my assumptions. As you can see from the previously attached plsql code the insert molecule function includes a commit for each molecule inserted.

We are building our experience with using the Jchem cartridge to manage molecule registrations into the AKos Samples Catalogue.

I use temporary 'load' tables for the import data that are subsiquently processed by the plsql procedures.

With our equipment we process aprox 10,000-20,000 structures per hour. (this includes selection/insertions/updates into a related Batch table). This processing rate is currently acceptable for our business purposes.

After processing several tens of thouands of molecules this process has slowed down 10 fold. (1000 - 2000 structures per hour) This slow rate no longer meets our business requirement.

As an initial step I am rebuilding the domain index and will then continue the registration process.

I would be gratefull if you are able to advise me on other steps I should take to problem solve this issue.

Many thanks

Bernard

ChemAxon aa7c50abf8

27-06-2013 13:29:32

Hi Bernard,

Looking at the JCC benchmarks (http://www.chemaxon.com/jchem/doc/admin/Performance.html#jcart_insert), it is apparent that the time taken to insert into a JChem index doesn't linearily increase with the number of structures inserted. The insert time is not even roughly close to constant... (The benchmark is slightly different from your use case, as executes only inserts into duplicate filtered index [created with duplicateFiltering=y] without distinct searches.) So far, we haven't payed attention to this, so I am not sure at this point what is behind this behaviour...

I understand that you also do batch inserts. What may be worhtwhile trying is to use a fresh connection after each 5000 structure inserted. The fact that the simple inserts (without search) scale linearily reminds me of a similar problem (inserts combined with molconversion -- still under investigation) which has the same work-around.

I also thought of suggesting to reduce drastically the frequency of the commits. But, while frequent commits are known to put a noticable burden on performance, I would expect them to require constant time to complete, so I am somewhat sceptical whether economizing on commits would help in this case. May be worthwhile to try if you have a chance...

Peter

User 773d472e7f

01-07-2013 16:43:57

Hi Peter,

I decided to rebuild the domain index.

After this was completed I restarted the (unmodified) registration process.

The registration speed has returned to the original acceptable level.

After several tens of thousands of regestrations the process seems to be continuing perform at this level.

Does this give you any clue as to the underlying cause?

Regards

Bernard

ChemAxon aa7c50abf8

02-07-2013 09:43:19

Hi Bernard,

I speculate that the speed improvement can be attributed to the fact that -- after recreating the index -- you re-started importing in a fresh database session. Did you try the work-around I suggested in my previous post which consists of doing the import in batches of just a couple thousand structures and doing each batch in a newly created connection?

Peter

User 773d472e7f

02-07-2013 09:53:24

Hi Peter,

Before rebuilding the domain index I did restart the oracle sessions between registration runs with no apparent improvement in the speed.

After the rebuild there has so far been no speed issue.

If the speed issue occures again I will again try the renew session approach as a first step to try to understand the issue further.

So far with several 10s of 1000s this has not been required.

Thanks for the help.

Bernard.

User 773d472e7f

11-02-2014 17:29:50

I am not sure if Bernard has included in his code that a new connection is made every 5000 structures. I will ask him.

This last registration required ca. 10 million compounds. We do it in batches of 1.8 million. The speed was about 1092 stuctures per minute. After 17 million the registration slowed down to 267 strucutres per minute. I rebuild the index, restarted the computer, but the speed is still only 200 structures per minute.

After I rebuild the index. I wanted to try searches in Instant JChem. I could not build the cache, it told me there was not enough memory. The system used 8.2 GB and the computer has 16 GB of RAM. I creased the heap size to 5GB, same problem, I increased it to 7 GB, finally it built the cach. This took overnight. A full structure search takes about 20 seconds, a substructure search takes about 45 seconds.

I tried a SSS during the registration, and this took longer than 15 minutes, and I stopped the query. This means I need a second installtion if I want to do searches.

What worries me is that it takes overnight if one wants to do the first search.

Any ideas, what could be wrong, or is this the best performance and we reach a size limit of the database. Can it be an Oracle problem. We are using 12 30GB files for the table space on three disks.

Luckily, I have a local db with 15 million strucutres using derby on another computer.
The SSS takes 7 seconds, a fullSS takes about 3 seconds. I can compare
this with an Oracle database of 16 milllion structures. The SSS and FullSS
wereabout 30-40 seconds. Why is the Oracle version so much
slower? Is there something wrong with our Oracle db?

Alex

ChemAxon d4fff15f08

12-02-2014 15:32:36

Hi Alex,

This behaviour seems very strange to me too.

Did you try to query the ORA DB table with standard SQL queries (e.g. select count(*) from [tablename] )? How fast is it (relatively)?

Regarding the memory usage we have some measured data here: https://www.chemaxon.com/jchem/doc/admin/Performance.html#hwrequirements

Building up the cache shouldn't take that much time neither. Could you please doublecheck the access to your table.

We will try to reproduce your problems, but it surely would take some time, so I am asking for your patience.

I will keep you posted with updates.

Best regards,

Norbert

User 773d472e7f

13-02-2014 09:50:49

Let's separate the issues in Oracle and Instant JChem.

Instant Jchem: I talked yeaterday to Mihaly in Zurich,and he thinks that IJC is not using the cartridge, since the heap size needs to be so big. I don't know off-hand how to connect Instant Jchem via the cartridge to Oracle.

At the moment I am doing a huge job, so Idon't want to disturb this. I will do this next week.

Oracle: When it could process about 1000 structures per minute, there were only 10% inserts (new strucutures). Now, with about 200 per minute, I am doing only inserts (all new structures). I would need to know every method how to optimize Oracle and the index. I don't have this knowledge.

Search time using the following script takes

call jchem_core_pkg.use_password('akos_samples', 'akos');
select to_char(sysdate,'YYYY MM DD HH24:MI:SS') "Started" from dual;
select akos_number from akos_moltable where jc_compare(CTAB_CLOB, 'InChI=1S/C14H15NO3/c1-3-18-14(17)8-15-10(2)12(9-16)11-6-4-5-7-13(11)15/h4-7,9H,3,8H2,1-2H3', 't:d') = 1;
select to_char(sysdate,'YYYY MM DD HH24:MI:SS') "Ended" from dual;

Started
-------------------
2014 02 13 10:37:04

AKOS_NUMBER
---------------
AKOS000100205

Ended
-------------------
2014 02 13 10:37:05

However, I don't think this is a real measure for a " find current search" if one would start with drawing a structure. How would I test this?

call jchem_core_pkg.use_password('akos_samples','akos');
select to_char(sysdate,'YYYY MM DD HH24:MI:SS') "Started" from dual;
select akos_number from akos_moltable where jc_compare(CTAB_CLOB, 'InChI=1S/C14H15NO3/c1-3-18-14(17)8-15-10(2)12(9-16)11-6-4-5-7-13(11)15/h4-7,9H,3,8H2,1-2H3', 't:s') = 1;
select to_char(sysdate,'YYYY MM DD HH24:MI:SS') "Ended" from dual;

Started
-------------------
2014 02 13 10:42:14

AKOS_NUMBER
---------------
AKOS000100205
AKOS000479344
AKOS000604930
AKOS001854819
AKOS001855140
AKOS003947137

6 Zeilen ausgewõhlt.

Ended
-------------------
2014 02 13 10:49:48

Ok...This seems to have a prolem. I am doing the insert job in the background.

Thanks for your help!

Alex

User 773d472e7f

18-02-2014 15:16:52

After inserting 21 million strucutres using the ORACLE cartridge I tried searches in IJC.

It took several hours to build the cache, but not any longer the whole night.

A SSS took 45-50 seconds until the first hit was displayed. I did abou 10 searches with different structures of different complexity. The time was always in this same range.

Registration:

When I insert structures, what happens with the index? Is the index refreshed each time after a commit? This would explain why the insert is so slow. Slow means 3 structures per second.

Can one insert 1 million structures and postpone the optimizing of the index? How would this influence the "find current" search? After all I have to check for duplicates.

Alex

ChemAxon d4fff15f08

19-02-2014 13:07:27

Hi Alex,

Thank you for the information. I am happy to hear that you managed to create a table that works as you are used to it. However, I am still interested in your former table, it would be nice to figure out what happened. Do you have any idea (was there any difference in the creation of the present table compared to the faulty one)? Can you try to query the "slow" table with standard (not cartridge) SQL commands e.g.: select count(*) from [tablename], or any other simple SQL command. I would be interested in the speed of these queries.

Concerning the indices: you are correct! At every insert the appropriate index is created for the newcomer. However, our tests show this process to be fast and taking just a negligible portion from the insert time. There is no function for postponing the index update on import. The only thing you can do is to drop the index, execute the import and rebuild the index for the whole table. This will probably not save you any time.

Best regards,

Norbert

User 773d472e7f

26-02-2014 23:04:42

You write: Can you try to query the "slow" table with
standard (not cartridge) SQL commands e.g.: select count(*) from
[tablename], or any other simple SQL command. I would be interested in
the speed of these queries. I will do this when the system is idle. At the moment, we run a time consuming SQL procedure.

When Oracle was idle: Using IJC a SSS takes in averyage 65-80 seconds, a "Find Current" takes about 45 seconds. The size of the database is over 22 million.

Using IJC I tried to export 1 million structures into an SDFile. The process ran out of memory when it reached 16 GB.

We implemented the SSS on the Web. The database is on Linux using MySQL, and the size is 16 million. A SSS takes often more than 30 seconds. The idex was freshly created. We are moving the system to a Windows server where the search system (iScienceSearch) and the cartridge will be on the same server. Let's see if we can speed up the SSS searches.

Thank you for explaining that the index is updated after each inser. It is hard to imagine that this is not a time consuming step having a very large database. Do you have any tesst results? I will try your suggestion when I have to do larger batches again.

Thanks for your help,

Alex

ChemAxon d4fff15f08

27-02-2014 16:00:37

Hi Alex,

Thank you for your cooperation!

Regarding your points:

1. Ok, thank you, I am still interested.

2. These search times seem a little high to me, but I am not an IJC expert (some of them will also check this post and will a react on this). You can get a rough guess about the typical search times checking out this site: https://www.chemaxon.com/jchem/doc/admin/Performance.html

As you can see the running time scales with the number of hits returned (the more hits need to be retrieved the longer the search time will be)

3. IJC guys will react on this.

4. I am a little confused regarding your Web implementation; what kind of DB access do you use? Our cartridge supports only Oracle. If you are using MySQL than I suppose you are not using cartridge, and as a consequence there is no need for indexes (which would be used by cartridge).

5. The indices are created on every insert but just for the new records, not recreating it for the whole table (they have been already calculated), so it does not scale up with the size of the table.

Best regards,

Norbert

ChemAxon 2bdd02d1e5

27-02-2014 16:01:30

Hi Alex, I'll answer IJC related questions:

When Oracle was idle: Using IJC a SSS takes in averyage 65-80 seconds, a "Find Current" takes about 45 seconds. The size of the database is over 22 million.

Unfortunately we don't have any numbers how fast it is on such large database. I'm afraid that this is realistic. Are these numbers with JChem Cartridge on Oracle?

Using IJC I tried to export 1 million structures into an SDFile. The process ran out of memory when it reached 16 GB.

What structures are they? Seems really too memory consuming, unless they are Markush structures.

Filip