[osiris-devel] Re: possible fix for osirismd defunct procs
David Vasil
dmvasil at ornl.gov
Tue Jan 2 14:39:32 EST 2007
lemmings wrote:
> On Thu, Dec 21, 2006 at 11:34:05AM -0500, David Vasil wrote:
>> The original osirismd code had a signal handler for sigchld which would
>> only check if a signal occurred before and after wait_for_data() in
>> osirismd.c:process(). I added a signal handler specifically for the
>> osirismd threads that handle scans to md_scan.c. All it does is when it
>> gets a SIGCHLD it issues a wait().
>
> Your patch also does a log_info().
>
> It is generally a bad idea to do anything other than set a flag (as
> the current code does) or calling a known safe function (as documented
> in individual library calls. A partial list is in signal(2)) in a
> signal handler.
>
> The reason is that many functions are not reentrant (For example,
> log_info() calls syslog(3) which will probably call malloc. If an
> existing malloc is in progress then you may get corruption of the
> heap). The end result is infrequent, hard to diagnose crashes.
Understandable, I put that log_info in there to make sure the signal
handler I added in was being used instead of the signal handler in
osirismd.c. If the log_info was removed and all that was in there was a
wait() that should be OK since it is a reentrant function.
>> Are there any brave souls who experience the zombie problem that could
>> test this patch out? If it fixes the problem I'll get Bruce to make a
>> 4.2.3 release shortly after New Years.
>
> A quick read of the source code suggests that the problem is likely to
> be the code that was commented out in check_for_signals(). A patch
> like the following (untested) would probably work better.
Thanks for looking into this and providing feedback for my original
patch. I took a clean 4.2.2 branch and applied your patch and am
testing it in my environment now. The patched osirismd I sent out
originally ran since the 20th of Dec. without a defunct process.
Regarding your patch, what happens if an osirismd scan finishes and
before it is able to set received_sigchld to 0 in check_for_signals()
another osirismd scan finishes. Will that second scan become a defunct
process since received_sigchld was set to 0 before another wait() occurred?
Thanks for your help, your patched version has been running for 5 hours
without a defunct proc, I'll keep it running for a while longer to test
more.
--
-dave
More information about the osiris-devel
mailing list