13 Replies Last post: Aug 6, 2009 11:54 PM by joegen  
  8 posts since
Jul 28, 2009

Jul 28, 2009 11:18 AM

Debugging OpenSBC


Hi,


I am running opensbc 1.1.5-108 and am experiencing crashes on a very frequent basis. Can you give me some help in debugging this. I am presently not getting very much from the core files its generating.

for instance, in the latest crash, all i get is : -


[root@rcl-dr-sb01 opensbc]# gdb ./opensbc115 core.6121
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Cannot access memory at address 0xb7f37000
(gdb) bt
#0 0x00e9f402 in ?? ()
Cannot access memory at address 0xb67ebc1c
(gdb)


Not sure thats much help, given other posts Ive read.


My setup is UA > OSBC > PROVIDER.

There are up to 60 call center agents using the phones, however Ive seen the crash happen with a very light load.


Cheers


Stuart


Click to view joegen's profile   539 posts since
Apr 28, 2007
2. Jul 28, 2009 8:17 PM in response to: darkone338
Re: Debugging OpenSBC
You are hiting a file handle limit. check 'ulimit -n' and set it to maximum value (65536)
Click to view joegen's profile   539 posts since
Apr 28, 2007
5. Jul 29, 2009 7:53 PM in response to: darkone338
Re: Debugging OpenSBC
Would you be able to provide a packet capture of the packet that produced the Invalid Index Error? I've checked SIPHeaderC::GetEntry() and it is called from within Contact::GetURI(). I am suspecting that this is the 302 response from 172.20.4.51
Click to view joegen's profile   539 posts since
Apr 28, 2007
7. Jul 30, 2009 8:03 PM in response to: darkone338
Re: Debugging OpenSBC

Looking at the backtrace as well as the log, the crash happens in writing the very first RTP packet. A crash h ere is highly unlikely unless the socket is in an undefined state. How many connections are active before this happens? Try using SIPP and in a scenario that gradually increases the number of concurrent calls and see if you could reproduce this crash at a constant number of concurrent connections.

I believe you that you are setting ulimit -n to a proper value but i've seen systems that silently accepts this but really does not change the hard limit. after setting ulimit -n 65534, did you issue "ulimit -n" again to check if the change reflected?

Click to view joegen's profile   539 posts since
Apr 28, 2007
9. Aug 4, 2009 9:46 PM in response to: darkone338
Re: Debugging OpenSBC
inline ...

{quote:title=darkone338 wrote:}{quote}
ive checked and double checked the ulimit setting and confirmed
65534. Since the crash of thursday last week, all was well until this
afternoon. Under fairly moderate load (less than 20 calls) opensbc died
with the following error in the log:
Ok

2009/08/04 17:20:30.128 INF: [CID=0x29830f01] >>> SIP/2.0
180 Ringing Method(INVITE) DST: 172.20.4.57:45503:UDP
SRC=172.20.5.252:5061 enc=0 bytes=570
2009/08/04 17:20:30.134 PWL: [CID=0x00000000] RTP_UDP
SetRemoteSocketInfo: session=1 data channel, new=212.11.91.123:14164,
local=<sbc external ip removed>:43314-43315, remote=0.0.0.0:0-0
2009/08/04 17:20:30.134 PWL: [CID=0x00000000] RTP_UDP
SetRemoteSocketInfo: session=1 control channel,
new=212.11.91.123:14165, local=<sbc external ip removed>:43314-43315,
remote=212.11.91.123:14164-14165
2009/08/04 17:20:30.146 INF: [CID=0x29830f01] >>> SIP/2.0
183 Session Progress Method(INVITE) DST: 172.20.4.57:45503:UDP
SRC=172.20.5.252:5061 enc=0 bytes=921
2009/08/04 17:20:30.384 PWL: [CID=0x00000000] RTP Set remote address from first Data PDU from 172.20.4.57:18056 enc=0
2009/08/04 17:20:30.384 PWL: [CID=0x00000000] RTP_UDP
SetRemoteSocketInfo: session=1 data channel, new=172.20.4.57:18056,
local=172.20.5.252:43312-43313, remote=0.0.0.0:0-1
2009/08/04 17:20:30.421 PWL: [CID=0x00000000] PWLib Assertion
fail: Function pthread_mutex_unlock failed, file tlibthrd.cxx, line 1929


Corefile:

(gdb) bt
#0 0x00358402 in __kernel_vsyscall ()
#1 0x00d2e4cb in read () from /lib/libpthread.so.0
#2 0x082ac81e in PX_SuspendSignalHandler ()
#3 0x082ad1d2 in PThread::Suspend ()
#4 0x0811a323 in Tools::OSSApplication::Main ()
#5 0x08268e42 in PServiceProcess::_main ()
#6 0x0804e3c1 in main ()
(gdb)

I know this is somehow related to threading.


The core file is not accurate. However the log is very helpful. After checking the latest patches to rtp.cxx, I can see that SetRemoteSocketInfo MAY be called from different threads without being guarded by a mutex. So it makes sence now that a simple call to RTP_UDP::WriteData() can cause crash. I have introduced a mutex around SetRemoteSocketInfo as well as WriteData/WriteControl to ensure the integrity of the critical section. Please try the latest in CVS and let me know if you experience the same crash again.

I
took a look at the resources SBC is using on my system and top reports
the opensbc process using <20% of the CPU time on a dual CPU P3
1.4Ghz HP DL360. (ps -ef reports cpu usage as a solid 25% however ?!)

I don't quite get what the question here. Can you further clarify?
snip

009/08/04 21:33:42.978 PWL: [CID=0x00000000] Parser TransactionId::operator=() - No Call-ID present in message
2009/08/04 21:33:42.979 PWL: [CID=0x00000000] *** MATCH *** sip:172.20.4.38 --> <sip:172.20.*> sip:172.20.5.252
2009/08/04 21:33:42.979 PWL: [CID=0x00000000] Using Iface: 172.20.5.252 to send to Dest: 172.20.4.38

I get those same 3 lines for each IP phone connected through the sbc. (Cisco 7940)

On the phone itself I see the following messages that I think are related:

[21:37:58:44755937] SIPProcessUDPMessage: recv UDP message from <172.20.5.252>:<50195>, length=<4>, message=
[21:37:58:44755938]

[21:37:58:44755938]
MSG LINE READ FAILURE DUE TO RS->EOF
[21:37:58:44755939] sipSPISendErrorResponse: Error: Did not find valid CSeq header. Cannot send response.
[21:37:58:44755940] sipSPISendErrorResponse: Error: Build flag is unsuccessful. Will not send message.
[21:37:58:44755940] SIPTaskProcessSIPMessage: Error: sipSPISendErrorResponse(400) failed.
[21:37:58:44755941] SIPTaskProcessSIPMessage: Error: sippmh_is_message_complete() returned error.
These are keep-alives packet sent by OSBC to ensure that the NAT binding keeps open. The 4 bytes is a series of CRLF/CRLF and can be safely ignored. So far, I yet have to see a UA that pukes when it receives two blank lines.
Click to view joegen's profile   539 posts since
Apr 28, 2007
12. Aug 6, 2009 11:26 PM in response to: darkone338
Re: Debugging OpenSBC
Right. I've mutexed ReadData as well. I forgot that read and write operations are separate thread for each leg. Can you recompile using the latest library?
Click to view joegen's profile   539 posts since
Apr 28, 2007
13. Aug 6, 2009 11:54 PM in response to: joegen
Re: Debugging OpenSBC
Also, just to be sure we are on the same page. The changes are in OpenSIPStack so make sure you update/rebuild opensipstack together with opensbc