wpa_supplicant: handle EAPOL initiation (data path) delivered before "associated" message (nl80211) [172225523]

Bug

adexe s nau

Status Update

No update yet.

Description

br...@google.com

created issue #1

Mar 3, 2020 12:48AM

Migrated from

https://crbug.com/1057863

Forked off this (and other comments from that bugs):

https://bugs.chromium.org/p/chromium/issues/detail?id=1044396#c18

"The "pending EAPOL" handling only lasts for 100ms -- after that, wpa_supplicant will just discard the EAPOL. The customer logs show a delay of approx 160ms. That would be enough for us to (silently, unfortunately) drop the pending packet, apparently."

---

The TL;DR: it's possible for wpa_supplicant to receive the initiation for a 4-way handshake (EAPOL, a data packet) before it processes the nl80211 message to say that the network is connected (associated). This causes it to dump this message, and postpone the handshake:

DEBUG wpa_supplicant[721]: wlan0: Not associated - Delay processing of received EAPOL frame (state=ASSOCIATING bssid=00:00:00:00:00:00)

Later, there's an arbitrary time threshold -- if we don't get around to handling this handshake within XXX milliseconds, we totally drop it. This happens to cause interop problems with certain APs (e.g., smartphone hotspots) which don't retry the EAPOL initiation.

Patches like this improve the situation:

https://chromium-review.googlesource.com/c/chromiumos/third_party/hostap/+/2036221

but they're not foolproof. There are definitely still occasions where we see more than a 200ms skew.

---

Filing this bug to track coming up with a better (more reliable) solution than a time-based drop mechanism. The original issue is resolved, but it'd be good to see if we can do better.

Comments

kb...@google.com <kb...@google.com> #2Nov 16, 2020 10:47PM

It seems you should use com.intellij.psi.PsiClass.getSuperClassType() to check

br...@google.com <br...@google.com> #3Jul 20, 2021 06:05PM

Status: New

Analyzing dynamic features on their own like that doesn't work properly since even though they "depend" on the app they're actually including into an app so for example the unused resource check only makes sense when you look at it from the app perspective, not the other way around.

I believe in 7.0 we've fixed this in the sense that there isn't a lint task on the individual dynamic features.

ma...@google.com <ma...@google.com> #4Aug 4, 2021 06:41PM

Assigned to ma...@google.com.

+1. In AGP 7.0 there are no lint tasks in dynamic-features; instead, any dynamic-features are analyzed when running lint on the app module. OP, can you try with AGP 7.0.0-rc01, run lintRelease on the corresponding application module, and see if you still hit the same issue?

ma...@google.com <ma...@google.com> #5Aug 4, 2021 09:10PM

Reassigned to bi...@google.com.

Not using dynamic feature modules but have this with an application module and a library module, happening on 7.0.0-rc01. I haven't figured out a repro yet but will upload one if I manage.

bi...@google.com <bi...@google.com> #6Aug 11, 2021 05:53PM

Re #5, yes, a repro project would be very helpful, thanks!

bi...@google.com <bi...@google.com> #7Aug 25, 2021 07:08PM

Closing this bug as not reproducible, but please reopen if you have a repro project.

bi...@google.com <bi...@google.com> Sep 22, 2021 06:50PM

Reassigned to bu...@google.com.

bu...@google.com <bu...@google.com> #8Sep 22, 2021 06:50PM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait for

b/190650266

to be resolved and then assign the bug to billyzhao@google.com.

Bugjuggler:

b/190650266

-> assigned to billyzhao@google.com

bu...@google.com <bu...@google.com> Oct 12, 2021 12:16AM

Assigned to bi...@google.com.

no...@google.com <no...@google.com> #9Nov 19, 2021 10:34PM

comment #3 mentioned upstream commits that could potentially help. We've recently uprev'd supplicant (bug 193926134) so those fixes should be included now. What are the next steps? See if the bug is still happening even after the uprev?

ap...@google.com <ap...@google.com> #10Dec 22, 2021 09:28PM

Project: chromium/src
Branch: main

commit 6c859a7081de5376dc256387b21e61e7f2b18bc2
Author: Billy Zhao <billyzhao@chromium.org>
Date: Wed Dec 22 21:25:39 2021

Add TimeFromRekeyToFailureSeconds histogram

Implemented in CL:3224374

Bug: b:186763776,b:172225523,b:193155280
Change-Id: Iaa9387c01d5cf802a3d579a41a435fcaf2dc0622
Reviewed-on:

https://chromium-review.googlesource.com/c/chromium/src/+/3293964
Reviewed-by: Weilun Shi <sweilun@chromium.org>
Commit-Queue: Billy Zhao <billyzhao@chromium.org>
Cr-Commit-Position: refs/heads/main@{#953641}

M tools/metrics/histograms/metadata/network/histograms.xml

https://chromium-review.googlesource.com/3293964

ap...@google.com <ap...@google.com> #11Jan 5, 2022 08:58PM

Project: chromiumos/platform2
Branch: main

commit 07bea62f74515450e0c63c675c8f8fe526708569
Author: Billy Zhao <billyzhao@chromium.org>
Date: Thu Oct 14 22:01:51 2021

shill: Add UMA metric to track network instability after rekey

In b:186763776, we notice that the device's network connection becomes
unstable quickly after a rekey. We do not know the scope of how
prevalent this issue is, so we add a metric to track how often it is
that a device's connection becomes unstable after a rekey attempt.

Once a rekey is initiated, if the network becomes unstable within 3
minutes, we record the number of seconds it took for the device to
fail.

BUG=b:186763776,b:172225523,b:193155280
TEST=shill unit tests

Change-Id: Id1617ccb7c98d3072fa6fd5fc3bf2beb7481a1fe
Reviewed-on:

https://chromium-review.googlesource.com/c/chromiumos/platform2/+/3224374
Tested-by: Billy Zhao <billyzhao@chromium.org>
Reviewed-by: Matthew Wang <matthewmwang@chromium.org>
Reviewed-by: Jun Yu <junyuu@chromium.org>
Commit-Queue: Billy Zhao <billyzhao@chromium.org>

M shill/wifi/wifi_service.cc
M shill/wifi/wifi.cc
M shill/wifi/wifi_service.h
M shill/metrics_test.cc
M shill/metrics.cc
M shill/metrics.h

https://chromium-review.googlesource.com/3224374

bi...@google.com <bi...@google.com> #12Jan 11, 2022 11:31PM

Status: New

Relevant metrics have landed. I think at this point we should see if the customer is still experiencing the issue.

br...@google.com <br...@google.com> #13Jan 11, 2022 11:44PM

I'll repeat a note from comment #3: "It also requires a kernel upgrade, or backporting, for kernel<4.17."

So, since the Octopus family is still running kernel 4.14, nothing has taken effect there yet.

no...@google.com <no...@google.com> #14Jan 12, 2022 07:51PM

Assigned to bi...@google.com.

<triage> - octopus is scheduled for kernel uprev to 5.10 in M101 per go/uprev-cal - if the backporting effort is too high, it makes sense to wait for the uprev in M101

AI:

billy to look into the effort required for backporting
billy to monitor the metrics to evaluate the severity/impact of the bug. That could require waiting for the stable release (M99)

bi...@google.com <bi...@google.com> #15Jan 19, 2022 07:12PM

I believe this branch merge introduced the required functionality into kernel 4.17 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17&id=1dd2eefb8efa97b6b495c4f09fa8232727b93db1

At first glance, It seems like backport effort seems tractable, but I lack experience with backporting so this effort would probably be non-trivial for me. If I get some help to help expedite the backport effort, it might make sense for us to backport this.

ma...@google.com <ma...@google.com> #16Jan 19, 2022 07:31PM

Reassigned to bu...@google.com.

<triage> We'll wait until 2 weeks after M99 hits stable for Billy's metric to collect data and reassess the priority then.

Bugjuggler: wait until 2022-3-14 -> billyzhao

bu...@google.com <bu...@google.com> #17Jan 19, 2022 07:31PM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait until 2022-03-14 00:00 -0700 PDT and then assign the bug to billyzhao.

ku...@google.com <ku...@google.com> #18Jan 26, 2022 07:17PM

<Triage notes> Since the next steps are mentioned in comment #16 (wait 2 weeks after M99 hits stable), marking as triaged.

je...@google.com <je...@google.com> #19Jan 26, 2022 07:29PM

re comment #3 and ("nl80211: Control port over nl80211 helpers"), there's been upstream activity about specifically prioritizing control traffic (https://lore.kernel.org/r/20210206115112.567881-1-markus.theil@tu-ilmenau.de). I needed this for an unrelated bug (b/206865368), but it may be this is helpful for other types of control traffic.

bu...@google.com <bu...@google.com> Mar 14, 2022 07:02AM

Assigned to bi...@google.com.

bi...@google.com <bi...@google.com> #20Mar 14, 2022 07:49AM

Status: New

I have taken a look at the metric stats (metric name : TimeFromRekeyToFailureSeconds, internal link not included due to this bug being on the public tracker). We do not see many reproductions of a connection failure shortly after rekey attempt. Marking untriaged and unassigning to discuss in triage.

no...@google.com <no...@google.com> #21Mar 16, 2022 06:35PM

Assigned to bu...@google.com.

<triage> Kernel uprev in M101 should mitigate the issue. Metric mentioned in comment #20 confirms that the priority is correct.

Assigning to bugjuggler, wait until 4 weeks after M101 reaches stable.

Bugjuggler: wait until 2022-05-31

bu...@google.com <bu...@google.com> #22Mar 16, 2022 06:40PM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait until 2022-05-31 00:00 -0700 PDT and then unassign the bug.

bu...@google.com <bu...@google.com> May 31, 2022 07:03AM

Status: New

ma...@google.com <ma...@google.com> #23Jun 1, 2022 07:01PM

Assigned to ju...@google.com.

<triage> Jun to verify whether or not kernel uprev fixed this issue.

ju...@google.com <ju...@google.com> #24Jun 6, 2022 07:04AM

Reassigned to bu...@google.com.

The uprev schedule on octopus has been pushed back to kernel 5.15 on M107 according to go/uprev-cal

By looking at the other boards with the same wifi chipset Intel AC7265:

asuka uprev 4.19
cave uprev 4.19
coral not uprev 4.4

https://screenshot.googleplex.com/A28vzP7SQBSED6m The curves showing normalized rekey failures per device indicate conflicted data. coral is on par with cave, while both of them are worst than asuka.

The factors may impact the rekey failures may be regions, pillars or etc. I still need the octopus UMA data after M107 stable to evaluate the kernel patch.

Bugjuggler: wait until 2022-10-27 -> junyuu@

bu...@google.com <bu...@google.com> #25Jun 6, 2022 07:09AM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait until 2022-10-27 00:00 -0700 PDT and then assign the bug to junyuu.

bu...@google.com <bu...@google.com> Oct 27, 2022 07:03AM

Assigned to ju...@google.com.

ju...@google.com <ju...@google.com> #26Nov 9, 2022 05:53AM

Reassigned to bu...@google.com.

Since current rekey metric isn't reliable because it doesn't consider the issue when roam between different BSSID, which itself should be distinguished from same BSSID rekey issue.

Billy is working in the CL https://chromium-review.googlesource.com/c/chromiumos/platform2/+/3989543

So current metric is premature to verify whether kernel uprev can really fix this rekey issue caused by EAPOL out-of-order. Pushing back the work.

Bugjuggler: -> junyuu

bu...@google.com <bu...@google.com> #27Nov 9, 2022 05:54AM

Accepted by bu...@google.com.

Hi. I've received your bug, you didn't specify what to wait for so I will wait for

b/190650266

to be resolved and wait for

b/250079464

to be resolved and then assign the bug to junyuu.

Bugjuggler:

b/190650266

b/250079464

-> assigned to junyuu

bu...@google.com <bu...@google.com> Jan 30, 2023 05:20PM

Assigned to ju...@google.com.

ju...@google.com <ju...@google.com> #28Feb 3, 2023 10:57PM

Billy's improvement to rekey UMA metrics landed https://chromium-review.googlesource.com/c/chromiumos/platform2/+/3989543. Wait 3 months until enough new UMA data has been collected.

Xamine shows the log signature wlan0: Not associated - Delay processing of received EAPOL frame still exists on uprev-ed devices like grunt and kernel > 4.17 ones like dedede. xamine Link. There is no clear metric to check for the user impact regard the behaviors depicted by this kind log messages.

The UPSTREAM patch partially fix the problems during rekeying https://chromium-review.googlesource.com/c/chromiumos/third_party/hostap/+/4017982. One of metric can be checked is the number of disconnections with reason rekey failure before and after the CL landed. So this is the next step.

ma...@google.com <ma...@google.com> #29Feb 4, 2023 12:23AM

A couple things:

I think we need to refocus this bug - it's not clear how widespread the user impact of this symptom is. Just because we see Delay processing of ..., it doesn't mean that there's user impact. There's a retry mechanism such that as long as the assoc event comes in "soon", we'll still catch the next retransmission of the EAPOL packet. In fact, just spot-checking a couple of the feedback reports, I wasn't able to see any connection issues. It would be better to use FRA to check how often that log line is shortly succeeded by a handshake timeout (possibly look at bugs that Brian worked on in the past linked above where he was able to repro the issue to extract the correct signature).
This bug has nothing to do with rekeying in particular. In fact, I'm struggling to see how this would actually impact rekeying given that this bug is about EAPOL dropping before "associated" message, which we don't need during a rekey. Again, FRA could help us determine whether this actually ever happens during a rekey.

Jun, can you use FRA to run a couple queries so we can correctly set the priority here and potentially just close this bug out if we can't tie to any discernible user impact?

ju...@google.com <ju...@google.com> #30Mar 9, 2023 01:27AM

Investigation of logs "Delay processing ..."

I queried the Feedback reports with following SQLs in Xamine

system_logs.name = 'syslog' AND REGEXP_CONTAINS(log_lines.content, "Delay processing of received EAPOL frame") AND "network_connection_failure" IN UNNEST(analyzer_tags)

And the same log line with network_disconnect in FRQ tags.

Found zero matched report. The Delay processing ... logs almost have no correlation with WiFi disconnections.

Revisited Brian's bug and repro.

After reading through the original bug, according to https://bugs.chromium.org/p/chromium/issues/detail?id=1044396#c26, the correct signature worths persuing should be Process pending EAPOL if the latency of EAP message is longer than 200ms.

I ran some queried with Xamine and FRA

system_logs.name = 'netlog' AND REGEXP_CONTAINS(log_lines.content, "Process pending EAPOL") AND "network_connection_failure" IN UNNEST(analyzer_tags)

I did find some usable feedback reports that depicting the logs "Process pending EAPOL " and close WiFi disconnections. For examples:

https://listnr.corp.google.com/product/208/report/90749777747

2023-02-12T23:50:29.762176Z DEBUG wpa_supplicant[680]: wlan0: Process pending EAPOL frame that was received just before association notification
2023-02-12T23:50:29.762190Z DEBUG wpa_supplicant[680]: wlan0: RX EAPOL from [MAC OUI=a8:9a:93 IFACE=27]
...
2023-02-12T23:50:38.800219Z DEBUG wpa_supplicant[680]: WPA: Derived Key MIC - hexdump(len=16): 7d 15 09 eb b3 85 6e e0 24 d2 ff 11 4f c1 a5 27
2023-02-12T23:50:39.762401Z NOTICE wpa_supplicant[680]: wlan0: Authentication with [MAC OUI=a8:9a:93 IFACE=27] timed out.

And https://listnr.corp.google.com/product/208/report/90750242542

2023-02-09T09:42:20.906586Z DEBUG wpa_supplicant[692]: wlan0: Process pending EAPOL frame that was received just before association notification
2023-02-09T09:42:20.906600Z DEBUG wpa_supplicant[692]: wlan0: RX EAPOL from [MAC OUI=80:2a:a8 IFACE=25]
...
2023-02-09T09:42:24.922652Z DEBUG wpa_supplicant[692]: nl80211: Delete station [MAC OUI=80:2a:a8 IFACE=25]
2023-02-09T09:42:24.929000Z DEBUG wpa_supplicant[692]: nl80211: Drv Event 39 (NL80211_CMD_DEAUTHENTICATE) received for wlan0
2023-02-09T09:42:24.929025Z DEBUG wpa_supplicant[692]: nl80211: Deauthenticate event
2023-02-09T09:42:24.929050Z DEBUG wpa_supplicant[692]: wlan0: Event DEAUTH (11) received
2023-02-09T09:42:24.929080Z DEBUG wpa_supplicant[692]: wlan0: Deauthentication notification
2023-02-09T09:42:24.929119Z DEBUG wpa_supplicant[692]: wlan0:  * reason 2 (PREV_AUTH_NOT_VALID)

Reconsidering the priority

Need using more of Xamine to determin the priority.

ma...@google.com <ma...@google.com> #31Mar 9, 2023 01:31AM

Nice, thanks Jun!

ju...@google.com <ju...@google.com> Mar 13, 2023 05:07PM

Accepted by ju...@google.com.

ma...@google.com <ma...@google.com> #32Apr 7, 2023 03:59AM

Removing triage because it seems like we're waiting on priority assessment

ku...@google.com <ku...@google.com> #33Apr 12, 2023 06:39PM

<Triage notes> Jun to investigate the data to evaluate the priority.

ju...@google.com <ju...@google.com> #34Apr 20, 2023 05:34PM

Re-run the unioned query mentioned in comment#30 in Xamine Advanced Search

system_logs.name = 'netlog' AND (REGEXP_CONTAINS(log_lines.content, "Process pending EAPOL") OR REGEXP_CONTAINS(log_lines.content, "Delay processing of received EAPOL frame")) AND "network_connection_failure" IN UNNEST(analyzer_tags)

And found out the latest feedback reports matching above query dated on 2023-02-13, which is 2 month early. While I can search the same error messages in feedback reports without connection_failure tag filed today in Xamine. So I think the sympotm I tried to address in comment#30 went away at some point.

By looking at the code, the message "Delay processing of received EAPOL frame" and "Process pending EAPOL" are two phases of one control flow, with the former creating queue for the latter to consume. In case of "Process pending EAPOL", the EAPOL is deliver before ASSOC however it's still within grace period (200ms) before ASSOC, which is accepted by wpa_supplicant. For EAPOL delivered more than 200ms before the ASSOC, wpa_supplicant doesn't consider it's valid for EAPOL flow. Generally speaking, "Process pending EAPOL" isn't an indicator of connection failure. In the opposite, it's a signal that wpa_supplicant is processing it in the expected path.

Regarding the FR filed 2 month ago with message "Process pending EAPOL" and a network_connection_failure FR tag, for example this one, the message was actually seen after the connection_failure. The root cause of connection failure is unclear to me from what I can see from the logs.

2023-02-10 13:34:11.666 5 wpa_supplicant[843]: BSSID [MAC OUI=10:06:ed IFACE=13] ignore list count incremented to 2, ignoring for 10 seconds
2023-02-10 13:34:11.667 6 shill[1045]: INFO shill: [wifi.cc(1162)] WiFi wlan0 supplicant updated DisconnectReason to 15 (4-Way Handshake timeout)
...
2023-02-10 13:34:12.424 7 wpa_supplicant[843]: wlan0: Process pending EAPOL frame that was received just before association notification
2023-02-10 13:34:12.424 7 wpa_supplicant[843]: wlan0: RX EAPOL from [MAC OUI=10:06:ed IFACE=2]

In conclusion: as of today, I found no evidence that "Process pending EAPOL" is related to the likelihood of connection failure. And since FR's don't see such message coincides with connection failure, my suggestion is to close this bug as obsolete.

PS: If FRA and Xamine can compare timestamp of logs with starting/ending time of FRA events, it will be helpful to do deeper analysis.

ma...@google.com <ma...@google.com> #35Apr 20, 2023 07:32PM

Can you instead directly right some FRA queries instead of using analyzer tags? I'm not convinced analyzer tags are giving you the best signal. The second report you linked in comment #30 looks like good signal, would you be able to grab similar reports?

ju...@google.com <ju...@google.com> Jun 6, 2023 01:27AM

Assigned to ju...@google.com.

ma...@google.com <ma...@google.com> #36Sep 8, 2023 04:13PM

Status: New

This bug is 3 years old and it's unclear if there's meaningful impact. Unassigning and backlogging.

Issue 172225523

Description

Issue summary

Comments

kb...@google.com <kb...@google.com> #2Nov 16, 2020 10:47PM

br...@google.com <br...@google.com> #3Jul 20, 2021 06:05PM

ma...@google.com <ma...@google.com> #4Aug 4, 2021 06:41PM

ma...@google.com <ma...@google.com> #5Aug 4, 2021 09:10PM

bi...@google.com <bi...@google.com> #6Aug 11, 2021 05:53PM

bi...@google.com <bi...@google.com> #7Aug 25, 2021 07:08PM

bi...@google.com <bi...@google.com> Sep 22, 2021 06:50PM

bu...@google.com <bu...@google.com> #8Sep 22, 2021 06:50PM

bu...@google.com <bu...@google.com> Oct 12, 2021 12:16AM

no...@google.com <no...@google.com> #9Nov 19, 2021 10:34PM

ap...@google.com <ap...@google.com> #10Dec 22, 2021 09:28PM

ap...@google.com <ap...@google.com> #11Jan 5, 2022 08:58PM

bi...@google.com <bi...@google.com> #12Jan 11, 2022 11:31PM

br...@google.com <br...@google.com> #13Jan 11, 2022 11:44PM

no...@google.com <no...@google.com> #14Jan 12, 2022 07:51PM

bi...@google.com <bi...@google.com> #15Jan 19, 2022 07:12PM

ma...@google.com <ma...@google.com> #16Jan 19, 2022 07:31PM

bu...@google.com <bu...@google.com> #17Jan 19, 2022 07:31PM

ku...@google.com <ku...@google.com> #18Jan 26, 2022 07:17PM

je...@google.com <je...@google.com> #19Jan 26, 2022 07:29PM

bu...@google.com <bu...@google.com> Mar 14, 2022 07:02AM

bi...@google.com <bi...@google.com> #20Mar 14, 2022 07:49AM

no...@google.com <no...@google.com> #21Mar 16, 2022 06:35PM

bu...@google.com <bu...@google.com> #22Mar 16, 2022 06:40PM

bu...@google.com <bu...@google.com> May 31, 2022 07:03AM

ma...@google.com <ma...@google.com> #23Jun 1, 2022 07:01PM

ju...@google.com <ju...@google.com> #24Jun 6, 2022 07:04AM

bu...@google.com <bu...@google.com> #25Jun 6, 2022 07:09AM

bu...@google.com <bu...@google.com> Oct 27, 2022 07:03AM

ju...@google.com <ju...@google.com> #26Nov 9, 2022 05:53AM

bu...@google.com <bu...@google.com> #27Nov 9, 2022 05:54AM

bu...@google.com <bu...@google.com> Jan 30, 2023 05:20PM

ju...@google.com <ju...@google.com> #28Feb 3, 2023 10:57PM

ma...@google.com <ma...@google.com> #29Feb 4, 2023 12:23AM

ju...@google.com <ju...@google.com> #30Mar 9, 2023 01:27AM

Investigation of logs "Delay processing ..."

Revisited Brian's bug and repro.

Reconsidering the priority

ma...@google.com <ma...@google.com> #31Mar 9, 2023 01:31AM

ju...@google.com <ju...@google.com> Mar 13, 2023 05:07PM

ma...@google.com <ma...@google.com> #32Apr 7, 2023 03:59AM

ku...@google.com <ku...@google.com> #33Apr 12, 2023 06:39PM

ju...@google.com <ju...@google.com> #34Apr 20, 2023 05:34PM

ma...@google.com <ma...@google.com> #35Apr 20, 2023 07:32PM

ju...@google.com <ju...@google.com> Jun 6, 2023 01:27AM

ma...@google.com <ma...@google.com> #36Sep 8, 2023 04:13PM

Add comment

Issue metadata