Event loop stall bug in hostapd-0.7.3

Bryan Phillippe Bryan.Phillippe at watchguard.com
Thu Jan 5 19:08:47 EST 2012


Hi All,

For the last several weeks, I've been troubleshooting a problem with hostapd-0.7.3 where it gets stuck somewhere in the event loop and clients are no longer able to connect to the AP until I kill and restart hostapd.  I'm not sure what the repro is yet; this seems to happen between 3 to 10 days of uptime on the AP, and only for the busiest APs (I'm running the same version of hostapd on my home network and have never seen the problem).

When the event loop breaks, I will sometimes see a never ending sequence of these log messages:

Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 7c:61:93:3c:83:8c WPA: sending 1/4 msg of 4-Way Handshake
Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: EAPOL-Key timeout
Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: sending 1/4 msg of 4-Way Handshake
Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: EAPOL-Key timeout
Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: sending 1/4 msg of 4-Way Handshake
Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 7c:61:93:3c:83:8c WPA: EAPOL-Key timeout
Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 7c:61:93:3c:83:8c WPA: sending 1/4 msg of 4-Way Handshake
Jan  5 14:44:02 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: EAPOL-Key timeout

Also, I have seen this once before (when running in the foreground):

1323384829.239922: ath1: STA 58:94:6b:c1:8b:a8 WPA: sending 1/4 msg of 4-Way Handshake
1323384829.240242: WPA: Send EAPOL(version=2 secure=0 mic=0 ack=1 install=0 pairwise=8 kde_len=0 keyidx=0 encr=0)
1323384829.240357: nl80211: Event message available
1323384829.240421: nl80211: Ignored unknown event (cmd=19)
1323384829.340456: ath1: STA 58:94:6b:c1:8b:a8 WPA: EAPOL-Key timeout
1323384829.340798: WPA: 58:94:6b:c1:8b:a8 WPA_PTK entering state PTKSTART
1323384829.340851: ath1: STA 58:94:6b:c1:8b:a8 WPA: sending 1/4 msg of 4-Way Handshake
1323384829.341074: WPA: Send EAPOL(version=2 secure=0 mic=0 ack=1 install=0 pairwise=8 kde_len=0 keyidx=0 encr=0)


1323385061.356020: eloop: could not process SIGINT or SIGTERM in two seconds. Looks like there
is a bug that ends up in a busy loop that prevents clean shutdown.
Killing program forcefully.


I put some debugging instrumentation into the hostapd code to help me figure out what was happening.  Essentially, this code just increments an iteration counter and tracks a few other bits that are all then dumped to a file via a signal handler for USR2.  So when the hostapd appears to be screwed up, I kill it with USR2 and look at the contents of the file.  If everything is working normally, the event loop iteration counter has incremented since the last time.  Otherwise, the event loop iteration counter is frozen in time, and I start looking at the other bits to see if I can glean some idea of what's gone wrong.  Unfortunately, I don't have a working debugger setup yet, or I'd have done that straight-away.  Anyway, using this methodology, here is what I've found:

Event loop iteration counter stalls.

During one stall, the last known location in the event loop was here (src/utils/eloop.c):

                if (timeout) {
                        os_get_time(&now);
                        if (!os_time_before(&now, &timeout->time)) {
                                void *eloop_data = timeout->eloop_data;
                                void *user_data = timeout->user_data;
                                /* Here is the last place I know it was */
                                eloop_timeout_handler handler =
                                        timeout->handler;
                                eloop_remove_timeout(timeout);
                                handler(eloop_data, user_data);
                        }

                }
                /* Never reached here nor made it back to top of event loop */

And I'm getting the perpetual EAPOL timeout messages in the log.

In another stall, I found that it was stuck on a blocking system call inside handle_monitor_read() (src/drivers/driver_nl80211.c), called from here (src/utils/eloop.c):

                if (res <= 0)
                        continue;

                eloop_sock_table_dispatch(&eloop.readers, rfds); /* <---- stuck in this function */
                /* Never reached here nor made it back to top of event loop */
                eloop_sock_table_dispatch(&eloop.writers, wfds);
                eloop_sock_table_dispatch(&eloop.exceptions, efds);
        }


I haven't been able to track it down any farther in either case.

Here is a snippet of my config (running on top of the latest stable compat-wireless for ath9k on Linux-2.6.38):

interface=ath1
driver=nl80211
bridge=eth2
wpa=2
eapol_version=2
wpa_passphrase=XXXX
wpa_key_mgmt=WPA-PSK
wpa_pairwise=CCMP
beacon_int=200
wpa_group_rekey=86400
wpa_gmk_rekey=86400
wpa_strict_rekey=0


Please let me know if you have any ideas for me to try.

--
-bp



More information about the HostAP mailing list