More information (was Re: Event loop stall bug in hostapd-0.7.3)
Bryan.Phillippe at watchguard.com
Thu Jan 12 13:40:47 EST 2012
I learned some things about this while debugging it with a non-optimized version of hostapd. I believe that the appearance of the corrupted private data structure was due to the optimization in the debugger. That would explain why the sendmsg() is not immediately returning with an EBADF on the monitor_sock.
I think what's actually happening is that the sendmsg() on the monitor_sock is indeed blocking. I guess that could be a problem in the nl80211 driver in the kernel instead of something wrong with hostapd? I'm going to start investigating that side of it now. If you have any advice on that, please let me know.
On Jan 11, 2012, at 10:59 PM, wrote:
> On Jan 11, 2012, at 9:51 PM, Jouni Malinen wrote:
>> On Mon, Jan 09, 2012 at 10:59:17PM +0000, Bryan Phillippe wrote:
>>> Well, I was able to debug this problem more during a repro today. I found a lot of information. Basically, we're stuck in wpa_driver_nl80211_send_frame() from src/drivers/driver_nl80211.c here:
>> How easily can you reproducer this? What platform (CPU, etc.) do you
>> have on the AP? Would you be able to run hostapd under valgrind by any
> I've been working on a repro. As of now, I believe this seems to be more reproducible when there are a lot of clients scanning for APs. This also appears to result in a lot more event loop iterations, which is probably why it increases the reproducibility. This is consistent with the problem only hitting our most heavily-used APs. My personal AP at home, with about 5 stations on it, has never had this problem. But the one in our sales conference room seems to suffer from it a few times a week, or even a few times a day if all the people are in there using it...
> I've seen it on two types of APs, one is an IXP ARM-based platform and the other is PPC (Freescale).
> I'll have to get back to you on valgrind; it's a great tool and I've used it before, but it will be some effort for me to get it working here.
>>> The sendmsg() is blocked on the monitor_sock, which is apparently blocking IO and unable to send for some reason.
>> I don't think that this is the real issue - the real issue is that
>> something got corrupted before the call:
> Yes, I agree - since my post, I came to this conclusion by comparing the structure under normal operation to what it looks like when the problem occurs. It actually appears that everything after the brname is wrong - that ifindex is very wrong as well, for example. Also, the if_removed, capa, and other fields are all bogus. So it almost seems like the brname copy is overrunning the remainder of the structure.
> Another very important bit of information: the brname is and should be "ethN", not "ath1". I've set up a bridge already called ethN that has ath1 and some other Ethernet interfaces added to it, and during normal operation, the brname is always "ethN". However, something is causing it to be set to "ath1" during runtime. I've looked over the functions that set the brname and I'm trying to figure out if that is being called during runtime, or if the structure is being stampled over some other way.
>> I don't remember seeing this type of issue. Would you be able to test
>> the current development snapshot from hostap.git master branch? It would
>> be interesting to see whether this could have already been addressed.
>> valgrind could also be able to pinpoint the actually place where the
>> structure gets corrupted.
> I downloaded the latest snapshot, but it will take me some time to set up my cross-compiler to build it and load it on these APs. I'll report back when I get that going
>> Jouni Malinen PGP id EFC895FA
More information about the HostAP