From osiris-devel at lemmin.gs Mon Jan 1 08:34:28 2007 From: osiris-devel at lemmin.gs (lemmings) Date: Tue, 2 Jan 2007 00:34:28 +1100 Subject: [osiris-devel] Re: possible fix for osirismd defunct procs In-Reply-To: <458AB77D.90005@ornl.gov> References: <458AB77D.90005@ornl.gov> Message-ID: <20070101133428.GA9020@lemmin.gs> On Thu, Dec 21, 2006 at 11:34:05AM -0500, David Vasil wrote: > The original osirismd code had a signal handler for sigchld which would > only check if a signal occurred before and after wait_for_data() in > osirismd.c:process(). I added a signal handler specifically for the > osirismd threads that handle scans to md_scan.c. All it does is when it > gets a SIGCHLD it issues a wait(). Your patch also does a log_info(). It is generally a bad idea to do anything other than set a flag (as the current code does) or calling a known safe function (as documented in individual library calls. A partial list is in signal(2)) in a signal handler. The reason is that many functions are not reentrant (For example, log_info() calls syslog(3) which will probably call malloc. If an existing malloc is in progress then you may get corruption of the heap). The end result is infrequent, hard to diagnose crashes. > Are there any brave souls who experience the zombie problem that could > test this patch out? If it fixes the problem I'll get Bruce to make a > 4.2.3 release shortly after New Years. A quick read of the source code suggests that the problem is likely to be the code that was commented out in check_for_signals(). A patch like the following (untested) would probably work better. e =================================================================== --- osirismd.c (revision 69) +++ osirismd.c (working copy) @@ -259,7 +259,6 @@ { client_length = sizeof( client ); - check_for_signals(); wait_for_data(); /* we have data waiting, see if it is to the server. if it * */ @@ -1453,6 +1452,8 @@ wait_start: + check_for_signals(); + FD_ZERO( &read_set ); /* add the control socket to the set. */ @@ -1484,7 +1485,6 @@ if( ( sresult < 0 ) && ( errno == EINTR ) ) { - check_for_signals(); goto wait_start; } } @@ -1792,37 +1792,34 @@ child_pid = wait(&status); -/* - while( ( pid = waitpid( -1, &status, WNOHANG ) ) > 0 || - ( ( pid < 0 ) && ( errno == EINTR ) ) ) - ; -*/ + while ( ( child_pid = waitpid( -1, &status, WNOHANG ) ) > 0 ) + { + if ( child_pid == scheduler_pid ) + { + log_error( LOG_ID_DAEMON_INFO, NULL, + "SIGCHLD, scheduler has terminated!" ); - if( child_pid == scheduler_pid ) - { - log_error( LOG_ID_DAEMON_INFO, NULL, - "SIGCHLD, scheduler has terminated!" ); + halt(EXIT_CODE_ERROR); + } - halt(EXIT_CODE_ERROR); - } + else if ( child_pid == log_pid ) + { + log_error( LOG_ID_DAEMON_INFO, NULL, + "SIGCHLD, log application was terminated!" ); + } - else if ( child_pid == log_pid ) - { - log_error( LOG_ID_DAEMON_INFO, NULL, - "SIGCHLD, log application was terminated!" ); - } + else if ( getpid() == scheduler_pid ) + { + log_info( LOG_ID_SCHEDULER_INFO, NULL, + "SIGCHLD, scheduler child process has terminated." ); + } - else if ( getpid() == scheduler_pid ) - { - log_info( LOG_ID_SCHEDULER_INFO, NULL, - "SIGCHLD, scheduler child process has terminated." ); - } - - else - { - log_info( LOG_ID_DAEMON_INFO, NULL, - "SIGCHLD, client process has terminated." ); - } + else + { + log_info( LOG_ID_DAEMON_INFO, NULL, + "SIGCHLD, client process has terminated." ); + } + } } if( received_sigpipe ) From abiacco at decentrix.net Tue Jan 2 12:20:34 2007 From: abiacco at decentrix.net (Anthony J Biacco) Date: Tue, 2 Jan 2007 10:20:34 -0700 Subject: [osiris-devel] osirismd crashing on windows server 2003 Message-ID: <4BE79F875F2BBF47870810A2404CC9D024ACCE@iron.DecentrixInc.local> I'm running Osiris 4.2.2 on the server and on all clients. Server (Osiris MD) is Windows Server 2003 (w/SP1) server Every so often, seemingly at random intervals, the Osiris MD service crash. Last night it did it twice in an hour, but hadn't at all for about 4 days. I'll usually get this in the Windows event log: Faulting application osirismd.exe, version 0.0.0.0, faulting module osirismd.exe, version 0.0.0.0, fault address 0x0000d73f. Reporting queued error: faulting application osirismd.exe, version 0.0.0.0, faulting module osirismd.exe, version 0.0.0.0, fault address 0x0000d73f. Same fault address every time. Anyone seen this and/or know a fix for it? Thanx, -Tony ------------------------------------ Anthony J. Biacco Senior Systems/Network Administrator Decentrix Inc. 303-899-4000 x303 From dmvasil at ornl.gov Tue Jan 2 14:39:32 2007 From: dmvasil at ornl.gov (David Vasil) Date: Tue, 02 Jan 2007 14:39:32 -0500 Subject: [osiris-devel] Re: possible fix for osirismd defunct procs In-Reply-To: <20070101133428.GA9020@lemmin.gs> References: <458AB77D.90005@ornl.gov> <20070101133428.GA9020@lemmin.gs> Message-ID: <459AB4F4.8060108@ornl.gov> lemmings wrote: > On Thu, Dec 21, 2006 at 11:34:05AM -0500, David Vasil wrote: >> The original osirismd code had a signal handler for sigchld which would >> only check if a signal occurred before and after wait_for_data() in >> osirismd.c:process(). I added a signal handler specifically for the >> osirismd threads that handle scans to md_scan.c. All it does is when it >> gets a SIGCHLD it issues a wait(). > > Your patch also does a log_info(). > > It is generally a bad idea to do anything other than set a flag (as > the current code does) or calling a known safe function (as documented > in individual library calls. A partial list is in signal(2)) in a > signal handler. > > The reason is that many functions are not reentrant (For example, > log_info() calls syslog(3) which will probably call malloc. If an > existing malloc is in progress then you may get corruption of the > heap). The end result is infrequent, hard to diagnose crashes. Understandable, I put that log_info in there to make sure the signal handler I added in was being used instead of the signal handler in osirismd.c. If the log_info was removed and all that was in there was a wait() that should be OK since it is a reentrant function. >> Are there any brave souls who experience the zombie problem that could >> test this patch out? If it fixes the problem I'll get Bruce to make a >> 4.2.3 release shortly after New Years. > > A quick read of the source code suggests that the problem is likely to > be the code that was commented out in check_for_signals(). A patch > like the following (untested) would probably work better. Thanks for looking into this and providing feedback for my original patch. I took a clean 4.2.2 branch and applied your patch and am testing it in my environment now. The patched osirismd I sent out originally ran since the 20th of Dec. without a defunct process. Regarding your patch, what happens if an osirismd scan finishes and before it is able to set received_sigchld to 0 in check_for_signals() another osirismd scan finishes. Will that second scan become a defunct process since received_sigchld was set to 0 before another wait() occurred? Thanks for your help, your patched version has been running for 5 hours without a defunct proc, I'll keep it running for a while longer to test more. -- -dave From osiris-devel at lemmin.gs Tue Jan 2 17:41:11 2007 From: osiris-devel at lemmin.gs (lemmings) Date: Wed, 3 Jan 2007 09:41:11 +1100 Subject: [osiris-devel] Re: possible fix for osirismd defunct procs In-Reply-To: <459AB4F4.8060108@ornl.gov> References: <458AB77D.90005@ornl.gov> <20070101133428.GA9020@lemmin.gs> <459AB4F4.8060108@ornl.gov> Message-ID: <20070102224111.GA6612@lemmin.gs> On Tue, Jan 02, 2007 at 02:39:32PM -0500, David Vasil wrote: > > Regarding your patch, what happens if an osirismd scan finishes and > before it is able to set received_sigchld to 0 in check_for_signals() > another osirismd scan finishes. Will that second scan become a defunct > process since received_sigchld was set to 0 before another wait() occurred? The race condition won't cause a defunct process as there is a loop around waitpid() _after_ received_sigchld is set to 0. e From dmvasil at ornl.gov Mon Jan 8 11:09:56 2007 From: dmvasil at ornl.gov (David Vasil) Date: Mon, 08 Jan 2007 11:09:56 -0500 Subject: [osiris-devel] Re: possible fix for osirismd defunct procs In-Reply-To: <20070102224111.GA6612@lemmin.gs> References: <458AB77D.90005@ornl.gov> <20070101133428.GA9020@lemmin.gs> <459AB4F4.8060108@ornl.gov> <20070102224111.GA6612@lemmin.gs> Message-ID: <45A26CD4.9090007@ornl.gov> lemmings wrote: > On Tue, Jan 02, 2007 at 02:39:32PM -0500, David Vasil wrote: >> Regarding your patch, what happens if an osirismd scan finishes and >> before it is able to set received_sigchld to 0 in check_for_signals() >> another osirismd scan finishes. Will that second scan become a defunct >> process since received_sigchld was set to 0 before another wait() occurred? > > The race condition won't cause a defunct process as there is a loop > around waitpid() _after_ received_sigchld is set to 0. > > e I've been running your patched version of the defunct osirismd fix for a week now and defunct processes are still created, but they are cleaned up after the next sigchld is received by the osirismd scheduling process. Is this expected behavior? It is better than the current 4.2.2 release since the defunct processes are cleaned up. It just seems that it would be ideal if the defunct procs were never created. Let me know your thoughts; I'm going to test the build on BSD/Fedora/and WinNT platforms. -- -dave From osiris-devel at lemmin.gs Tue Jan 9 00:48:20 2007 From: osiris-devel at lemmin.gs (lemmings) Date: Tue, 9 Jan 2007 16:48:20 +1100 Subject: [osiris-devel] Re: possible fix for osirismd defunct procs In-Reply-To: <45A26CD4.9090007@ornl.gov> References: <458AB77D.90005@ornl.gov> <20070101133428.GA9020@lemmin.gs> <459AB4F4.8060108@ornl.gov> <20070102224111.GA6612@lemmin.gs> <45A26CD4.9090007@ornl.gov> Message-ID: <20070109054820.GD6366@lemmin.gs> On Mon, Jan 08, 2007 at 11:09:56AM -0500, David Vasil wrote: > > I've been running your patched version of the defunct osirismd fix for a > week now and defunct processes are still created, but they are cleaned > up after the next sigchld is received by the osirismd scheduling > process. Is this expected behavior? Yes. As the signal check for needing to reap children is periodic there will inevitably be zombies that will accumulate, however they will all get reaped in time as you have seen. The number that are outstanding at any one time shouldn't be very high and shouldn't cause problems. > It is better than the current 4.2.2 release since the defunct processes > are cleaned up. It just seems that it would be ideal if the defunct > procs were never created. Let me know your thoughts; I'm going to test > the build on BSD/Fedora/and WinNT platforms. I/O and signals are not fun to mix. To obtain rapid reaping of children requires either: 1) Reap child in signal handler: No (interferes with detailed log diagnostics). 2) Have event handling core of osirismd use non blocking I/O and I/O multiplexing. Indicate child that needs to be reaped by writing a byte to pipe in signal handler. 3) Have event handling core of osirismd use signal driven I/O and appropriate use of sigsuspend/sigprocmask/etc. Options 2 & 3 would require a significant change to osirismd... thus leading to the current approximate (but simple) solution of use an atomic flag and periodically poll for the status of the flag before/after I/O. The approximate solution could be improved with additional timeouts/checks if warranted. e From dmvasil at ornl.gov Tue Jan 9 07:57:51 2007 From: dmvasil at ornl.gov (David Vasil) Date: Tue, 09 Jan 2007 07:57:51 -0500 Subject: [osiris-devel] Re: osirismd crashing on windows server 2003 In-Reply-To: <4BE79F875F2BBF47870810A2404CC9D024ACCE@iron.DecentrixInc.local> References: <4BE79F875F2BBF47870810A2404CC9D024ACCE@iron.DecentrixInc.local> Message-ID: <45A3914F.2040808@ornl.gov> Anthony J Biacco wrote: > I'm running Osiris 4.2.2 on the server and on all clients. Server > (Osiris MD) is Windows Server 2003 (w/SP1) server > Every so often, seemingly at random intervals, the Osiris MD service > crash. Last night it did it twice in an hour, but hadn't at all for > about 4 days. > > I'll usually get this in the Windows event log: > > Faulting application osirismd.exe, version 0.0.0.0, faulting module > osirismd.exe, version 0.0.0.0, fault address 0x0000d73f. > Reporting queued error: faulting application osirismd.exe, version > 0.0.0.0, faulting module osirismd.exe, version 0.0.0.0, fault address > 0x0000d73f. > > Same fault address every time. > > Anyone seen this and/or know a fix for it? I havent tried using the OsirisMD on anything except Win XP Pro; and admittedly that was only to make sure it would compile. Since the error messages show it faulting at the same address each time, it may be helpful to try this HowTo out on Windows Services debugging to get more information: http://support.microsoft.com/kb/824344 -- -dave From abiacco at decentrix.net Tue Jan 9 12:03:36 2007 From: abiacco at decentrix.net (Anthony J Biacco) Date: Tue, 9 Jan 2007 10:03:36 -0700 Subject: [osiris-devel] Re: osirismd crashing on windows server 2003 In-Reply-To: <45A3914F.2040808@ornl.gov> References: <4BE79F875F2BBF47870810A2404CC9D024ACCE@iron.DecentrixInc.local> <45A3914F.2040808@ornl.gov> Message-ID: <4BE79F875F2BBF47870810A2404CC9D024AE5C@iron.DecentrixInc.local> Thanx, I'll try that and see what it spits out. -Tony ------------------------------------ Anthony J. Biacco Senior Systems/Network Administrator Decentrix Inc. 303-899-4000 x303 > -----Original Message----- > From: osiris-devel-bounces+abiacco=decentrix.net at lists.shmoo.com > [mailto:osiris-devel-bounces+abiacco=decentrix.net at lists.shmoo.com] On > Behalf Of David Vasil > Sent: Tuesday, January 09, 2007 5:58 AM > To: Osiris Developers > Subject: [osiris-devel] Re: osirismd crashing on windows server 2003 > > Anthony J Biacco wrote: > > I'm running Osiris 4.2.2 on the server and on all clients. Server > > (Osiris MD) is Windows Server 2003 (w/SP1) server > > Every so often, seemingly at random intervals, the Osiris MD service > > crash. Last night it did it twice in an hour, but hadn't at all for > > about 4 days. > > > > I'll usually get this in the Windows event log: > > > > Faulting application osirismd.exe, version 0.0.0.0, faulting module > > osirismd.exe, version 0.0.0.0, fault address 0x0000d73f. > > Reporting queued error: faulting application osirismd.exe, version > > 0.0.0.0, faulting module osirismd.exe, version 0.0.0.0, fault address > > 0x0000d73f. > > > > Same fault address every time. > > > > Anyone seen this and/or know a fix for it? > > I havent tried using the OsirisMD on anything except Win XP Pro; and > admittedly that was only to make sure it would compile. Since the > error > messages show it faulting at the same address each time, it may be > helpful to try this HowTo out on Windows Services debugging to get more > information: > > http://support.microsoft.com/kb/824344 > > -- > -dave > _______________________________________________ > osiris-devel mailing list > osiris-devel at lists.shmoo.com > https://lists.shmoo.com/mailman/listinfo/osiris-devel From dmvasil at ornl.gov Wed Jan 10 16:02:26 2007 From: dmvasil at ornl.gov (David Vasil) Date: Wed, 10 Jan 2007 16:02:26 -0500 Subject: [osiris-devel] Re: fixing the filters implementation In-Reply-To: <456DE642.6040104@ornl.gov> References: <455DB887.8060501@ornl.gov> <455F90A4.7020509@fidoki.com> <4561B000.1050004@ornl.gov> <45637249.1070505@ornl.gov> <456DE642.6040104@ornl.gov> Message-ID: <45A55462.4070405@ornl.gov> David Vasil wrote: > David Vasil wrote: >> I have this written against the current 4.2.2 release (the md_compare.c >> patch is also added into this patch). I have not applied this to the >> current svn trunk yet as I would like to have it tested a little before >> committing it. I tested this on Linux (Ubuntu, RHEL, Fedora), OpenBSD >> (4.0), and Windows (the code works as expected, but filters in general >> dont appear to work under Windows [have they ever? 4.2.2 without the >> patches behave the same way]). >> >> I added in a function to md_filter.c to create the filter file if it >> does not exist already by dumping the current filter database to the >> file. This will prevent people's filters from being lost after >> upgrading to this code. >> >> I'll be running this code on my system for a while to see how it works >> out. If others can run the code or look over it for obvious mistakes, >> please let me know. > > I've been testing this on my system without a problem for about a week > now. If noone has any objections, I'll commit this to the subversion > trunk and it will be part of the next release. I have commited this to the 4.2.3 build and have made a release package for 4.2.3 I'll get Bruce to post soon. I assume from the lack of comments on the filters change that everyone is OK with it (has anyone tried it out?). Here's a list of things from the ChangeLog that will be put out for this release: Differences with version 4.2.3 ================================================= FIXES: : Windows uninstaller now removes all osiris related registry keys during uninstall. : Linux mod_ports will only attempt to process the tcp procfiles if they exist. : Fixed a bug in the osirismd where the scan context was closed too early in the compare routine. : Fixed a bug in the CLI where print-db would see a race condition and fail. : Fixed the console and agent creation scripts to build the OpenBSD packages correctly. : Lemmings provided a fix to clean up defunct processes being created by the osirismd scheduling process. FEATURES: : Filters are now stored in a flat text file on the management daemon. Existing filters will be copied from the filter database into the flat text file if the flat file does not exist at the time of osirismd starting. This allows the filters to keep their order and makes comments in the filters much more useful -- -dave