More information (was Re: Event loop stall bug in hostapd-0.7.3)

Bryan Phillippe Bryan.Phillippe at watchguard.com
Mon Jan 9 17:59:17 EST 2012


Hi All,

Well, I was able to debug this problem more during a repro today.  I found a lot of information.  Basically, we're stuck in wpa_driver_nl80211_send_frame() from src/drivers/driver_nl80211.c here:

	};

	if (encrypt)
		rtap_hdr[8] |= IEEE80211_RADIOTAP_F_WEP;

	return sendmsg(drv->monitor_sock, &msg, 0); /* <--------------- stucky */
}

The socket is blocked, so hostapd sleeps forever.  I can hit it with signals and the signal handler will execute, but I have to kill the program and restart it to restore operation.  Here is the backtrace:

(gdb) where
#0  0x0fd16088 in sendmsg ()
   from /home/bp/P4/rootfs/root/lib/libc.so.6
#1  0x1002b614 in wpa_driver_nl80211_send_frame (drv=<value optimized out>,
    data=<value optimized out>, len=<value optimized out>,
    encrypt=<value optimized out>) at ../src/drivers/driver_nl80211.c:2783
#2  0x1000b6bc in hostapd_send_mgmt_frame (hapd=<value optimized out>,
    msg=<value optimized out>, len=<value optimized out>)
    at ../src/ap/ap_drv_ops.c:64
#3  0x10045bf8 in handle_probe_req (hapd=0x10072c40, mgmt=0x9fffeef6,
    len=<value optimized out>) at ../src/ap/beacon.c:331
#4  0x10049674 in ieee802_11_mgmt (hapd=0x10072c40, buf=0x9fffeef6 "@",
    len=70, fi=<value optimized out>) at ../src/ap/ieee802_11.c:1425
#5  0x1000b5ec in hostapd_mgmt_rx (ctx=<value optimized out>,
    event=<value optimized out>, data=0x9fffee80)
    at ../src/ap/drv_callbacks.c:302
#6  wpa_supplicant_event (ctx=<value optimized out>,
    event=<value optimized out>, data=0x9fffee80)
    at ../src/ap/drv_callbacks.c:422
#7  0x10030ca8 in handle_frame (sock=<value optimized out>,
    eloop_ctx=0x10072700, sock_ctx=<value optimized out>)
    at ../src/drivers/driver_nl80211.c:3141
#8  handle_monitor_read (sock=<value optimized out>, eloop_ctx=0x10072700,
---Type <return> to continue, or q <return> to quit---
    sock_ctx=<value optimized out>) at ../src/drivers/driver_nl80211.c:3229
#9  0x1001b1f4 in eloop_sock_table_dispatch (table=0x10071100, fds=0x100737b8)
    at ../src/utils/eloop.c:222
#10 0x1001b914 in eloop_run () at ../src/utils/eloop.c:588
#11 0x100044ec in hostapd_global_run (argc=<value optimized out>,
    argv=0x9ffffec4) at main.c:439
#12 main (argc=<value optimized out>, argv=0x9ffffec4) at main.c:548
(gdb) 

The sendmsg() is blocked on the monitor_sock, which is apparently blocking IO and unable to send for some reason.
More information:

(gdb) p *(struct wpa_driver_nl80211_data *)0x100727d0
$6 = {ctx = 0x10072700, netlink = 0x0, ioctl_sock = 13,
  brname = "ath1", '\000' <repeats 11 times>, ifindex = 8388608,
  if_removed = 21, capa = {key_mgmt = 16, enc = 13, auth = 18, flags = 16,
    max_scan_ssids = 16, max_remain_on_chan = 16}, has_capability = 0,
  operstate = 0, scan_complete_events = 0, nl_sock = 0x0, nl_sock_event = 0x0,
  nl_cache = 0x0, nl_cache_event = 0x0, nl_cb = 0x0, nl80211 = 0x0,
  auth_bssid = "\000\000\000\000\020\a", bssid = "'\364\000\000\000\020",
  associated = 0,
  ssid = '\000' <repeats 15 times>, "\021\020\a'\000\020\003M\020\020\003(@\000\000\001I", ssid_len = 268904872, nlmode = 268904872, ap_scan_as_station = 1,
  assoc_freq = 13107200, monitor_sock = 2346, monitor_ifidx = 2346,
  probe_req_report = 17432576, disable_11b_rates = 1,
  pending_remain_on_chan = 0, added_bridge = 0, added_if_into_bridge = 0,
  remain_on_chan_cookie = 0, send_action_cookie = 1154435205700780032,
  filter_ssids = 0x0, num_filter_ssids = 0, first_bss = {drv = 0x0,
    next = 0xffffffff, ifindex = 0,
    ifname = '\000' <repeats 12 times>"\377, \377\377\377", beacon_set = 0},
  eapol_sock = 0, default_if_indices = {0, 0, -1, 0, 0, 0, 0, -1, 0, 0, 0, 0,
    -1, 0, 0, 0}, if_indices = 0x0, num_if_indices = -1, last_freq = 0,
  last_freq_ht = 0}
(gdb) p ((struct wpa_driver_nl80211_data *)0x100727d0)->monitor_sock
$7 = 2346
(gdb) 

Seems like this socket should either be non-blocking or have a write timeout.  But chiefly, I guess I need to figure out why this socket is blocking in the first place.

As I said before, this seems to happen only while I have a lot of people hitting the AP.  Case in point, it's in a conference room at the office, and there was a meeting in there today...

Thoughts?

--
-bp


On Jan 5, 2012, at 4:08 PM, Bryan Phillippe wrote:

> Hi All,
> 
> For the last several weeks, I've been troubleshooting a problem with hostapd-0.7.3 where it gets stuck somewhere in the event loop and clients are no longer able to connect to the AP until I kill and restart hostapd.  I'm not sure what the repro is yet; this seems to happen between 3 to 10 days of uptime on the AP, and only for the busiest APs (I'm running the same version of hostapd on my home network and have never seen the problem).
> 
> When the event loop breaks, I will sometimes see a never ending sequence of these log messages:
> 
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 7c:61:93:3c:83:8c WPA: sending 1/4 msg of 4-Way Handshake
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: EAPOL-Key timeout
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: sending 1/4 msg of 4-Way Handshake
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: EAPOL-Key timeout
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: sending 1/4 msg of 4-Way Handshake
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 7c:61:93:3c:83:8c WPA: EAPOL-Key timeout
> Jan  5 14:44:01 2012 my-ap local3.debug hostapd: ath1: STA 7c:61:93:3c:83:8c WPA: sending 1/4 msg of 4-Way Handshake
> Jan  5 14:44:02 2012 my-ap local3.debug hostapd: ath1: STA 00:23:14:a4:36:30 WPA: EAPOL-Key timeout
> 
> Also, I have seen this once before (when running in the foreground):
> 
> 1323384829.239922: ath1: STA 58:94:6b:c1:8b:a8 WPA: sending 1/4 msg of 4-Way Handshake
> 1323384829.240242: WPA: Send EAPOL(version=2 secure=0 mic=0 ack=1 install=0 pairwise=8 kde_len=0 keyidx=0 encr=0)
> 1323384829.240357: nl80211: Event message available
> 1323384829.240421: nl80211: Ignored unknown event (cmd=19)
> 1323384829.340456: ath1: STA 58:94:6b:c1:8b:a8 WPA: EAPOL-Key timeout
> 1323384829.340798: WPA: 58:94:6b:c1:8b:a8 WPA_PTK entering state PTKSTART
> 1323384829.340851: ath1: STA 58:94:6b:c1:8b:a8 WPA: sending 1/4 msg of 4-Way Handshake
> 1323384829.341074: WPA: Send EAPOL(version=2 secure=0 mic=0 ack=1 install=0 pairwise=8 kde_len=0 keyidx=0 encr=0)
> 
> 
> 1323385061.356020: eloop: could not process SIGINT or SIGTERM in two seconds. Looks like there
> is a bug that ends up in a busy loop that prevents clean shutdown.
> Killing program forcefully.
> 
> 
> I put some debugging instrumentation into the hostapd code to help me figure out what was happening.  Essentially, this code just increments an iteration counter and tracks a few other bits that are all then dumped to a file via a signal handler for USR2.  So when the hostapd appears to be screwed up, I kill it with USR2 and look at the contents of the file.  If everything is working normally, the event loop iteration counter has incremented since the last time.  Otherwise, the event loop iteration counter is frozen in time, and I start looking at the other bits to see if I can glean some idea of what's gone wrong.  Unfortunately, I don't have a working debugger setup yet, or I'd have done that straight-away.  Anyway, using this methodology, here is what I've found:
> 
> Event loop iteration counter stalls.
> 
> During one stall, the last known location in the event loop was here (src/utils/eloop.c):
> 
>                if (timeout) {
>                        os_get_time(&now);
>                        if (!os_time_before(&now, &timeout->time)) {
>                                void *eloop_data = timeout->eloop_data;
>                                void *user_data = timeout->user_data;
>                                /* Here is the last place I know it was */
>                                eloop_timeout_handler handler =
>                                        timeout->handler;
>                                eloop_remove_timeout(timeout);
>                                handler(eloop_data, user_data);
>                        }
> 
>                }
>                /* Never reached here nor made it back to top of event loop */
> 
> And I'm getting the perpetual EAPOL timeout messages in the log.
> 
> In another stall, I found that it was stuck on a blocking system call inside handle_monitor_read() (src/drivers/driver_nl80211.c), called from here (src/utils/eloop.c):
> 
>                if (res <= 0)
>                        continue;
> 
>                eloop_sock_table_dispatch(&eloop.readers, rfds); /* <---- stuck in this function */
>                /* Never reached here nor made it back to top of event loop */
>                eloop_sock_table_dispatch(&eloop.writers, wfds);
>                eloop_sock_table_dispatch(&eloop.exceptions, efds);
>        }
> 
> 
> I haven't been able to track it down any farther in either case.
> 
> Here is a snippet of my config (running on top of the latest stable compat-wireless for ath9k on Linux-2.6.38):
> 
> interface=ath1
> driver=nl80211
> bridge=eth2
> wpa=2
> eapol_version=2
> wpa_passphrase=XXXX
> wpa_key_mgmt=WPA-PSK
> wpa_pairwise=CCMP
> beacon_int=200
> wpa_group_rekey=86400
> wpa_gmk_rekey=86400
> wpa_strict_rekey=0
> 
> 
> Please let me know if you have any ideas for me to try.
> 
> --
> -bp
> 
> _______________________________________________
> HostAP mailing list
> HostAP at lists.shmoo.com
> http://lists.shmoo.com/mailman/listinfo/hostap



More information about the HostAP mailing list