Memory Issues during rebuild

User c5c63b5c6a

01-02-2010 14:31:55

Upgrading from 526 to 530 on Oracle 10.2.0.4


During a rebuild of an index we got a number of memory errors on the database, instead of stopping the rebuild it went on to the next row and then got another error. Eventually it raised so many of these it resulted in crashing the database. This error should be caught and thrown to the calling procedure rather than being swallowed in the log. The crash could then have been avoided.


See jcart1.log for the trace of the error.

ChemAxon aa7c50abf8

01-02-2010 14:40:34

I guess either you forgot to attach the log file or the upload was silently rejected as *.log appears to be an extension not allowed for uploads here (http://www.chemaxon.com/forum/attach_rules.php?f=7). Maybe in a gzip-ped form...


Thanks


Peter

User c5c63b5c6a

01-02-2010 14:42:54

Retry upload for attachment

ChemAxon aa7c50abf8

03-02-2010 00:04:26

Rachel,

Thank you for the log file!

My understanding of your initial post is that the issue is not the memory errors themselves, but the way they are dealt with. (In this sense, the title of the topic is a misnomer.) Please, correct me, if I misunderstand.

After checking the code and performing multiple tests with simulated errors, I am strongly inclined to say that our INDEX REBUILD implementation does catch the kind of errors you were having and re-throws them to the calling procedure -- if it has a chance to do so.

The simulation in my test case consists of starting to throw artificial errors (with the message "Simulated error") for each new row to be indexed after a pre-defined number of rows have been successfully indexed. Here is the error message which was returned to the calling procedure in my test case:

-------------------------

00:20:35 SQL> alter index jcx&&table_name rebuild;

old 1: alter index jcx&&table_name rebuild
new 1: alter index jcxnci_100k rebuild
alter index jcxnci_100k rebuild
*
ERROR at line 1:
ORA-29858: error occurred in the execution of ODCIINDEXALTER routine
ORA-20101: Error while processing ROWID=AAA6XzAAEAAOuiZAAl: Simulated error,
C[N+](C)(C)CCOCC[N+](C)(C)C
ORA-06512: at "PKOVACS_TRUNK.JCHEM_CORE_PKG", line 31
ORA-06512: at "PKOVACS_TRUNK.JC_IDXTYPE_IM", line 48
-------------------------

I am also attaching the log file generated by my test case. Its structure is very similar to the structure of the log file you uploaded and can be divided into the following four main sections:

1. there are a number of lines starting with

SEVERE: Error while processing

(There are 32 such lines in your log file, 4 lines in mine -- one for each indexing thread having run into the error. This number is basically equal to the number of cores available for JChem Server for indexing. It might be less (depending on a number of factors and timing differences in thread scheduling), since the controller thread starts cancelling the indexing thread at the first sign of error, so not all indexing threads may run into the problem.)

2. followed by a number of lines starting with

SEVERE: Error in worker

(This is where the controller thread is reaping the aborted indexing threads and outputs the errors which caused the indexing threads to fail. The number of the reaped threads is less then the failing thread, because at the first aborted indexing thread the controller thread started to cancel the remaining threads and started to remove them from the active set as part of the cancelling.)

3. followed by a single line for the failed roll-back starting with

SEVERE: Io exception: End of TNS data channel

in your case

SEVERE: Simulated error in rollback

(This is the main RMI request thread already taking over from the controller thread and trying to roll-back the work done before the error occurred.)

4. followed by the last SEVERE log line

which introduces the stack trace including the JCartIndexCreate.buildIndexConcurrently method) in the sequence where the error is basically re-thrown to Oracle.

If Oracle is still working at this point, the error is propagated to the client process -- as in my case. But in your case, Oracle has crashed (as you said), so the error might have well been lost under the wreckage and that might be why it never made to the client process.

Peter



 


 


User c5c63b5c6a

03-02-2010 09:20:48

Yes it was the way that they are dealt with that I was asking about. The actual database memory issues I have now resolved.


I can't view the .gz file you attached - looks as though there was an issue loading it as it's 0 bytes.


The errors seem to be thrown to the log file but not to the sqlplus session where I was re-building the index which is where I would expect them to be thrown back to, wouldn't expect it to keep trying especially if it got the error more than once - there is obviously something wrong with the database which needs further investigation before the rebuild can complete.


I see in your reply that you were able to get an error thrown back to sqlplus. Were you able to simulate the DB crash during your rebuild and then this was the error that was received?


If we could guarantee it would be thrown back to sqlplus that would be the ideal situation.

ChemAxon aa7c50abf8

03-02-2010 13:41:13

I can't view the .gz file you attached - looks as though there was an issue loading it as it's 0 bytes.

Sorry, I am making a second attempt to upload it.


The errors seem to be thrown to the log file but not to the sqlplus session where I was re-building the index which is where I would expect them to be thrown back to, wouldn't expect it to keep trying especially if it got the error more than once - there is obviously something wrong with the database which needs further investigation before the rebuild can complete.

The error messages in the log file appear sequentially for readability, but the actual errors (the lines starting with SEVERE: Error while processing) occur basically at the same time -- in 32 simultaneously running threads on your machine (over the course of 7 seconds). In my log file, you will see (if I finally manage to upload it) that all thread identifiers for these error log lines (such as [pool-1-thread-6/16]) are different. In this sense, no one keeps here trying: concurrent threads are attempting the same kind of operation and when running into an error, they abort by themselves.


In addition to the threads' "self-destruction", the thread controller is also supposed to sense the errors and cancel all threads at the first occurrence of any error, but the cancellation appears to be significantly delayed -- especially in your log file. One thing to investigate could be the reason for this delay. There is perhaps room for improvement in this respect in our code, but due to the seemingly random nature of thread/process scheduling on non-real-time operating systems, we certainly could not achieve 100% thread-cancellation rate. (Cancelling takes time, during which other still active threads have a chance to execute...)


One option we could consider is to provide a parameter to instruct JChem Server to use only one thread for inserting into the database. This would guarantee that there will only be one single attempt per INDEX REBUILD to insert into an "ailing" database. But I expect this to have a noticeable effect on performance.


One consideration which I find important is that these errors occur while JChem Server itself is acting as a regular Oracle client. Imagine that 32 concurrent clients are trying to insert one row each into your "ailing" database (with shared memory set to too low). Would you expect these insert requests to make the database crash?


I see in your reply that you were able to get an error thrown back to sqlplus. Were you able to simulate the DB crash during your rebuild and then this was the error that was received?

No, the database didn't crash during any of these test runs. I think the fundamental problem in your case was that shared memory was too low compared to what was required by the workload. Any system facing memory shortage will quickly become fragile. (Programs allocate memory, because they need it. If they don't get the memory they need, the won't work. That is the ultimate maths of this, IMHO.)


If we could guarantee it would be thrown back to sqlplus that would be the ideal situation.

Did the INDEX REBUILD command return to SQL*Plus without any indication of errors -- implying that the index had been rebuilt successfully?


Thanks


Peter

ChemAxon aa7c50abf8

22-02-2010 19:16:16

One option we could consider is to provide a parameter to instruct JChem Server to use only one thread for inserting into the database.

I have implemented the above feature which will be available through the insSessCnt=1 index parameter sometimes later this year. For more details, please see: http://www.chemaxon.com/forum/viewtopic.php?p=26697#26697 .


Regards,


Peter

ChemAxon aa7c50abf8

19-04-2010 16:46:31

JChem 5.3.2 has been released with the new option to use only one insert thread.


Peter