Tuesday, October 11, 2011

ORA-00030 raised on kill session - troubleshoot with Oracle pstack

It all started with a DDL I had to perform - a Truncate.
This failed as it follows:
truncate table TRANSACTION_FLAGS_IOT;

truncate table TRANSACTION_FLAGS_IOT
Error report:
SQL Error: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired
00054. 00000 -  "resource busy and acquire with NOWAIT specified"
*Cause:    Resource interested is busy.
*Action:   Retry if necessary.

The error tells us that I still have an active DML on the resource, which have found being true, using the below query:

select session_id,owner,name,mode_held,blocking_others;
from dba_dml_locks
where name like 'TRANSACTION_FLAGS_IOT'
151 SMART TRANSACTION_FLAGS_IOT Row-X (SX) Not Blocking

Now, I want to see some more details on this session:

select
sid, serial#,paddr, status,type, last_call_et,blocking_Session,event,wait_class,state, command
from v$session
where sid=151

151 2439 000000015F4DBEA0 ACTIVE USER 537179  db file sequential read User I/O WAITING 0

Since this session is in Waiting state for so much time, I confirmed with users this can be safely killed:

alter system kill session '151,2439';

alter system kill session '151,2439'
Error report:
SQL Error: ORA-00030: User session ID does not exist.
00030. 00000 -  "User session ID does not exist."
*Cause:    The user session ID no longer exists, probably because the
           session was logged out.
*Action:   Use a valid session ID.

This normally happens when you want to kill a session that has a (large) transaction to rollback.
We cannot kill it since it has to rollback, so we expect that the serial# in v$session for the process will keep changing, but in my case, this change of serial# does not happen.
Such cleanup of a rollback is on a low priority, done by SMON background process.
I kept checking and checking, my transaction was not doing a rollback in the shadow, I suspect it was somehow blocked.
Now I don't want to kill the OS process associate with my session, I want to understand where is stuck.
So I identify the OS process as being 1783:

select addr,pid,spid,username,pga_used_mem from v$process where addr = '000000015F4DBEA0'
000000015F4DBEA0 22 1783 oracle 2343004

I verify the existance of the process on the DB server:

[oracle@performance ~]$ ps -flp 1783
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 D oracle    1783     1  0  77   0 - 1108041 sync_p Aug10 ?      00:54:24 oracletoptech (LOCAL=NO)

Now, in order to track the operations done in the background by my process, I take a pstack (process stack - current state of server process 1783):

[oracle@performance ~]$ pstack 1783
#0  0x000000364c60e3f3 in __pread_nocancel () from /lib64/libpthread.so.0
#1  0x00000000085b9ee1 in skgfqio ()
#2  0x000000000849c745 in ksfd_skgfqio ()
#3  0x000000000849ba4b in ksfd_io ()
#4  0x000000000849b104 in ksfdread ()
#5  0x0000000001b28a4d in kcfrbd1 ()
#6  0x0000000000d1d9c8 in kcbzib ()
#7  0x00000000082e5a04 in kcbgcur ()
#8  0x00000000082adb53 in ktbgcur ()
#9  0x00000000015353d5 in kdiulk ()
#10 0x0000000008476bf4 in kcoubk ()
#11 0x0000000008291fce in ktundo ()
#12 0x0000000000c53a5f in kturCurrBackoutOneChg ()
#13 0x0000000000c52037 in ktuabt ()
#14 0x0000000000c4c516 in ktcAbortTxn ()
#15 0x0000000000c46a23 in ktcsod ()
#16 0x000000000822c757 in kssdel ()
#17 0x000000000822da91 in kssdch ()
#18 0x0000000000b5dc5b in ksuxds ()
#19 0x0000000000b5d1b7 in ksudel ()
#20 0x00000000014394e1 in opidcl ()
#21 0x0000000001437bb1 in opidrv ()
#22 0x00000000018aac5b in sou2o ()
#23 0x00000000009d3eb5 in opimai_real ()
#24 0x00000000018afeea in ssthrdmain ()
#25 0x00000000009d3e21 in main ()

This shows that my process is doing some work, and have to identify where is waiting.
The mnemonics shown are described in ORA-600 Lookup Error Categories [ID 175982.1], for example, reading it buttom-up
    - ktc - transaction control operations at the block level
    - ktcAbortTxn - function to abort transaction
    - ktu - internal management of undo and rollback segments
The last (from buttom-up) command is __pread_nocancel () from /lib64/libpthread.so.0
The pread system call does an atomic position-and-read, in our case of /lib64/libpthread.so.0.
So I am having a blocking read, but why on /lib64/libpthread.so.0? Do I have this file, why my Oracle account is not able to read it?

[oracle@performance ~]$ cd /lib64
[oracle@performance lib64]$ ls -lrt libpthread.so.0
lrwxrwxrwx 1 root root 17 Nov 23  2010 libpthread.so.0 -> libpthread-2.5.so

The file is there, is defined as a symbolic link, but am not able to read it.
Now I will look in my environment variable LD_LIBRARY_PATH and will find the answer, the 64 bit library in not defined in my environment:

[oracle@performance ~]$ echo $LD_LIBRARY_PATH
/u01/app/oracle/product/11.1.0/db_1/lib:/lib:/usr/lib

So the /lib64/libpthread.so.0 could have never been able to be read with my current definition of the $LD_LIBRARY_PATH.
I had to do a classic kill -9 1783 in order to kill the process and continue to innitiate the Truncate command.
I then changed the variable definition.

References
ORA-600 Lookup Error Categories [ID 175982.1]
ORA-04031 "KSFQ Buffers" ksmlgpalloc [ID 453521.1]
http://blog.tanelpoder.com/2007/09/06/advanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack/

1 comment: