2007-07-04

AccessViolationException using Oracle and MTS

I have been annoyed by a really irritating bug in my test environment for quite some time. I use oracle and enlists the connection in the MTS and when I try to open the connection I get different errors (and sometimes there isn't any errors).
  • Oracle.DataAccess.Client.OracleException Data provider internal error(-3000)
  • Oracle.DataAccess.Client.OracleException : ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
It has also crashed my nunit application and the nunit-console processes that our continuous integration server launches has also crashed.
Searching for these errors have given several pointers where none was appropriate.

Today I finally found the source of the error. I used adplus to get a crash dump of the nunit process.
adplus -crash -pn nunit.exe

The adplus generated some dumps, log and a report. When I read the log I found a new error source. An AccessViolationError was thrown when oracle tried to enlist in the transaction.

Wed Jul 4 08:18:48.229 2007 (GMT+2): (c2c.9ec): Access violation - code c0000005 (first chance)
---
--- 1st chance AccessViolation exception ----
---------------------------------------------------------------

Occurrence happened at:
Debug session time: Wed Jul 4 08:18:48.229 2007 (GMT+2)
System Uptime: 0 days 16:37:08.859
Process Uptime: 0 days 0:01:37.803
Kernel time: 0 days 0:00:02.281
User time: 0 days 0:00:05.093

Faulting stack below ---
*** ERROR: Symbol file could not be found. Defaulted to export symbols for C:\WINDOWS\system32\msvcrt.dll -
# ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
00 04cbdcdc 77bbcfdb 003f0000 00000000 000000e0 ntdll!RtlRestoreLastWin32Error+0x235
01 04cbdcf0 77bba995 000000e0 00000000 0592a2f0 msvcrt!free+0x1a8
02 04cbdd04 04d46612 000000e0 00000000 06e0b228 msvcrt!operator new+0x24
03 04cbdd38 04ccc344 059468a0 0595e340 05918db0 ORAMTS10!kpntenlistctxget+0xe6
04 00000000 00000000 00000000 00000000 00000000 OraOps10w!OpsConEnlist+0x3b4
Here we can see that the ORAMTS10.dll is the source of the error. When I then googled on access violation ORAMTS10 I found the solution in the Microsoft forum where Sahra Parra already had debugged the same issue from another source.

The conclusion of the AccessViolation exception is that the ORAMTS10 contains a method that enlists the connection in the MTS. This method takes a parameter that contains the datasource name and when we pass a datasource longer than 40 characters, this results in that data in the heap is overwritten and resulted in a heap corruption.

Heap corruptions are hard to debug since the error don't show up when the data is written. The error surfaces when the corrupted data is read which explains that the error messages differs (or not surfaces at all).
In this case I got lucky and got an error in the ORAMTS10.dll that gave me a hint to the solution. To actually debug this issue and get error messages that occurs when the data is written, you have to use pageheap/gflag to let the error surface when the data is written.


To work around the bug in ORAMTS10 I only needed to change the Data Souce from the full (more than 40 characters) source name
Data Source=(DESCRIPTION=(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=TCP)(HOST=oracle.internal.com)(PORT=1521)))
(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)))

To the short version (requires a registration in tnsnames.ora)
Data Source=oracle

When changed, the crashes went away and I'm so happy ;)

1 comment:

  1. Thank you so much for sharing your solution. Your solution was very techical and logical from an software engineers perspective unlike many other recommended fixes that have failed countless times this one fixes the problem you describe. Maybe Oracle should hire you or at least give you a bonus for solving their bug. :) Thanks again.

    ReplyDelete