[PATCH] Keep a shadow copy of HFA384X_INTEN_OFF

Tue May 27 23:30:01 EDT 2003

On Tue, May 20, 2003 at 09:59:38AM +0300, Denis Vlasenko wrote:

> Hmm... this is a small bit of a bigger problem. If we are to guard
> against interrupt / schedule / other CPU hitting us between these
> two lines of code:
> 
>         HFA384X_OUTW(HFA384X_EVENT_MASK, HFA384X_INTEN_OFF);
>         ((local_info_t*)dev->priv)->shadow_hfa384x_inten = HFA384X_EVENT_MASK;
> 
> we are almost inevitably want to protect some surrounding code too,
> because setting IntEn is usually only one stage of some larger
> hw manipulation.

I have to admit that I'm not 100% sure about current code, but this is
partially (yeah, great ;-) protected by processing all the BAP events
from one tasklet. The problem here is that hw reset is a major exception
on this and having the shadow copy of the IntEn enabled race that ended
up in hanging the hardware in infinite interrupt loop. Removing the
shadow copy was a quick way of getting rid of that hang case.

One problem with Prism2 in general is the long duration of some commands
and I would prefer to use CmdCompl events instead of busy waiting for
the completion. Especially, I do not want to disable interrupts for
those waits. Semaphores would be fine for operations that are done only
in process context. I think that I have removed need for doing set/get
RID in interrupt context, but at least BAP access still needs protection
in (sw) interrupt handlers.

> Random example:
> 
> hostap_ioctl -> prism2_ioctl_siwrate -> hostap_set_rate,
> no locks/semaphores held, we arrive here:
> 
> ret = (hostap_set_word(dev, HFA384X_RID_TXRATECONTROL,
>                        local->tx_rate_control) ||
>        hostap_set_word(dev, HFA384X_RID_CNFSUPPORTEDRATES,
>                        local->tx_rate_control) ||
>        local->func->reset_port(dev));   <--- (see below)
> 
> What protects us between two set_port() calls or second set_port()
> and reset_port()?

Nothing and in this case I do not think additional protection would be
that useful. Set word is setting RID values and reset_port is asking
hardware to take new settings into use. Sure you could have two
processing running those in order that would end up having TXRATECONTROL
and CNFSUPPORTEDRATES set to different values. I do not see this as a
practical issue, but yes, it is not 100% correct.

> hostap_set_word -> set_rid -> hfa384x_set_rid:
> ...
>         res = down_interruptible(&local->rid_bap_sem);
> 	*** works with BAP and cmd ***
>         up(&local->rid_bap_sem);
> ...
>         if (res == -ETIMEDOUT)
>                 prism2_hw_reset(dev);  <--- (see below)
>         return res;
> }
> 
> Again, no protection between up(&local->rid_bap_sem) and
> prism2_hw_reset(dev).

hw_reset is recovery from hardware errors and it should not be used. I
see no need for additional locking in this case.

> static int prism2_reset_port(struct net_device *dev)
> ...
>         res = hfa384x_cmd(dev, HFA384X_CMDCODE_DISABLE, 0,
>                           NULL, NULL);
>         if (res)
>                 printk...
>         else {
>                 res = hfa384x_cmd(dev, HFA384X_CMDCODE_ENABLE, 0,
>                                   NULL, NULL);
> ...
> }
> 
> No protection between two hfa384x_cmd() calls

reset_port is used to take new configuration items into use, it is
enough to disable and enable the port and I think it would be ok for
something else to happen between those calls. Running another
configuration task concurrently, could mean that some changes from it
would be taken into use before the other part of the changes. This would
be "fixed" by the reset_port in the end of that configuration, so I do
not think this would be major issue.

> static int hfa384x_cmd(...)
> {
> ...
>         if (local->cmd_queue_len >= HOSTAP_CMD_QUEUE_MAX_LEN) {
>                 printk(KERN_DEBUG "%s: hfa384x_cmd: cmd_queue full\n",
>                        dev->name);
>                 return -1;
>         }
> 
> No protection against local->cmd_queue_len++ on another CPU.

cmd_queue_len can be a soft limit, so I do not see much harm in that.
cmd_queue_len dec/inc are protected by spin_lock_irqsave, so those
operations do not even need to be atomic.

> static void prism2_hw_reset(struct net_device *dev)
> {
> ...
>         if (local->hw_resetting) {
>                 printk(KERN_WARNING "%s: %s: already resetting card - "
>                        "ignoring reset request\n", dev_info, dev->name);
>                 return;
>         }
> ...
>         local->hw_resetting = 1;
> 
> Race against reset on another CPU.

If I remember correctly, I added hw_resetting only as an mechanism
against busy-looping resets and this would thus not be a major problem.

Even though I wrote some counter arguments for all these examples, I'm
not trying to say that everything is perfect in the current driver.
There used to be more difficulties in some of the locking cases since
more operations were performed in interrupt context. I haven't had
enough time to go through the current code thoroughly, but it might be
possible to add one semaphore for protecting most of the process context
operations.

I do not know when I would have time to go through the locking code in
the current version. From the practical view point, I would first like
to get rid of the reported crashes/hangs on SMP hosts, if there is a
quick way of removing (hiding?) these. After this, I could try to find a
suitable time for messing up with the locking (which easily means that
there will not be new releases for some time).

Please feel free to propose changes for the locking. However, please do
note that these should be obvious enough so that I can understand why
something should be changed. In addition, this kind of changes should
get enough testing so that I can trust them enough to work on most
platforms and all cs/plx/pci versions without moving back to busy
waiting..

-- 
Jouni Malinen                                            PGP id EFC895FA