Workaround for some SMP stability issues - request for testing
jkmaline at cc.hut.fi
Tue Aug 5 23:12:42 EDT 2003
Thanks to all the testing Michael Vallaly did with couple of patches to
Host AP driver, I think I now have quite a bit more information about
the stability issues with hostap_pci on SMP systems. It looks like there
are two separate issues. 1) Something is corrupting the information
transfered between card and driver which then results in frequent wlan
hw resets. 2) hw_reset with hostap_pci on SMP system seems to hang the
system completely in some cases (completely enough to even prevent NMI
watchdog from detecting hang).
It looks like the major cause for frequent resets was in fid register
(RX, TX, TX Error, AllocFid) getting corrupted. I do not have any
explanation for this apart from hardware/firmware bug. It looks like
consecutive reads of the fid registers results in different results even
though that register should not change before the event is acknowledged.
I added code that will try its best to make sure that the fid register
read will return correct fid number. This workaround will read the
register three times and will return the value only if at least two of
the reads returns the same value. This will be repeated up to five
The workaround seemed to eliminate more or less all corrupted fid values
in the test system. Consequently, there was no need to reset the
hardware and no host system hangs.
I added the workaround and test code into CVS and I would like to ask
people with SMP systems to test the CVS snapshot version and report what
they see in kernel log ('dmesg' output) and whether they see any system
hangs or in general changes to the previous versions of the driver.
In addition to the fid read workaround, I also changed TX fid array
handling to use heavier locking. Previously used locking may have been
insufficient on SMP systems. This change alone was also able to reduce
the number of hardware resets, but this may have been do to changed
timing (added latency due to heavier locking). Anyway, fid reads were
still producing corrupted results.
So far, I have mostly concentrated on findind out what is causing the
resets. I will next try to figure out what could be done about the
hw_reset and its relation to complete host system lockup.
Jouni Malinen PGP id EFC895FA
More information about the HostAP