kernel warning with 0.0.20170223: entered softirq 3 NET_RX net_rx_action+0x0/0x760 with preempt_count 00000101, exited with 00000100?

Jason A. Donenfeld Jason at zx2c4.com
Mon Feb 27 04:22:34 CET 2017


Hey Pipacs,

I've been receiving reports of strange bugs from grsec users with
WireGuard. The first set of bugs was a heisenbug crash, and I never
found the root cause, but it seemed to happen in the rx path. Then
today Timothée emailed another different bug from a grsec box, also
along the rx path. This time it was related to the preemption count
being wrong coming into and going out of the rx softirq. This kind of
preemption mismatch, I figure, might account for the earlier bug I
never solved.

So armed with this new information, I went hunting. I followed the
path inward, surrounding the body of each function with:

int i = preempt_count();
function_body...
if (i != preempt_count()) pr_err("LORDHAVEMERCY\n");

Eventually I isolated the bug to an interesting situation like this:

int i = preempt_count();
other_function(...);
if (i != preempt_count()) pr_err("This will print out\n");

void other_function(int a)
{
int vla[a];
int i = preempt_count();
function_body...
if (i != preempt_count()) pr_err("This will NOT print out\n");
}

Since I only got the outer print, I thought this was strange, so I rearranged:

void other_function(int a)
{
int i = preempt_count();
int vla[a];
if (i != preempt_count()) pr_err("This will print out\n");
function_body...
}

Yay, we found the bug. But wtf, what could possibly be changing the
preempt_count there?

So I went disassembling, and lo and behold the clever PaX stack leak
plugin was adding calls to pax_check_alloca. Very nice! But still, why
the preemption bug situation? I went hunting further:

void __used pax_check_alloca(unsigned long size)
{
 ...
       case STACK_TYPE_IRQ:
               stack_left = sp & (IRQ_STACK_SIZE - 1);
               put_cpu();
               break;
 ...
}

Do you see the bug? Looks like somebody snuck in a "put_cpu()" there,
where it really does not belong. "put_cpu()" basically just jiggers
the preempt_count. I can confirm that removing the erroneous call to
"put_cpu()" fixes the bug.

So, either this is by design, and there's some odd subtlety I'm
missing, or this is a bug that should be fixed in grsec/PaX.

In the case of the latter, I believe this introduces a security
vulnerability, since it opens up a whole host of interesting race
conditions that can be exploited.

Thanks,
Jason


More information about the WireGuard mailing list