Status Update
Comments
kb...@google.com <kb...@google.com> #2
br...@google.com <br...@google.com> #3
Analyzing dynamic features on their own like that doesn't work properly since even though they "depend" on the app they're actually including into an app so for example the unused resource check only makes sense when you look at it from the app perspective, not the other way around.
I believe in 7.0 we've fixed this in the sense that there isn't a lint task on the individual dynamic features.
ma...@google.com <ma...@google.com> #4
+1. In AGP 7.0 there are no lint tasks in dynamic-features; instead, any dynamic-features are analyzed when running lint on the app module. OP, can you try with AGP 7.0.0-rc01
, run lintRelease
on the corresponding application module, and see if you still hit the same issue?
ma...@google.com <ma...@google.com> #5
Not using dynamic feature modules but have this with an application module and a library module, happening on 7.0.0-rc01. I haven't figured out a repro yet but will upload one if I manage.
bi...@google.com <bi...@google.com> #6
Re #5, yes, a repro project would be very helpful, thanks!
bi...@google.com <bi...@google.com> #7
Closing this bug as not reproducible, but please reopen if you have a repro project.
bi...@google.com <bi...@google.com>
bu...@google.com <bu...@google.com> #8
Bugjuggler:
bu...@google.com <bu...@google.com>
no...@google.com <no...@google.com> #9
ap...@google.com <ap...@google.com> #10
Branch: main
commit 6c859a7081de5376dc256387b21e61e7f2b18bc2
Author: Billy Zhao <billyzhao@chromium.org>
Date: Wed Dec 22 21:25:39 2021
Add TimeFromRekeyToFailureSeconds histogram
Implemented in CL:3224374
Bug: b:186763776,b:172225523,b:193155280
Change-Id: Iaa9387c01d5cf802a3d579a41a435fcaf2dc0622
Reviewed-on:
Reviewed-by: Weilun Shi <sweilun@chromium.org>
Commit-Queue: Billy Zhao <billyzhao@chromium.org>
Cr-Commit-Position: refs/heads/main@{#953641}
M tools/metrics/histograms/metadata/network/histograms.xml
ap...@google.com <ap...@google.com> #11
Branch: main
commit 07bea62f74515450e0c63c675c8f8fe526708569
Author: Billy Zhao <billyzhao@chromium.org>
Date: Thu Oct 14 22:01:51 2021
shill: Add UMA metric to track network instability after rekey
In b:186763776, we notice that the device's network connection becomes
unstable quickly after a rekey. We do not know the scope of how
prevalent this issue is, so we add a metric to track how often it is
that a device's connection becomes unstable after a rekey attempt.
Once a rekey is initiated, if the network becomes unstable within 3
minutes, we record the number of seconds it took for the device to
fail.
BUG=b:186763776,b:172225523,b:193155280
TEST=shill unit tests
Change-Id: Id1617ccb7c98d3072fa6fd5fc3bf2beb7481a1fe
Reviewed-on:
Tested-by: Billy Zhao <billyzhao@chromium.org>
Reviewed-by: Matthew Wang <matthewmwang@chromium.org>
Reviewed-by: Jun Yu <junyuu@chromium.org>
Commit-Queue: Billy Zhao <billyzhao@chromium.org>
M shill/wifi/wifi_service.cc
M shill/wifi/wifi.cc
M shill/wifi/wifi_service.h
M shill/metrics_test.cc
M shill/metrics.cc
M shill/metrics.h
bi...@google.com <bi...@google.com> #12
Relevant metrics have landed. I think at this point we should see if the customer is still experiencing the issue.
br...@google.com <br...@google.com> #13
I'll repeat a note from
So, since the Octopus family is still running kernel 4.14, nothing has taken effect there yet.
no...@google.com <no...@google.com> #14
AI:
- billy to look into the effort required for backporting
- billy to monitor the metrics to evaluate the severity/impact of the bug. That could require waiting for the stable release (M99)
bi...@google.com <bi...@google.com> #15
I believe this branch merge introduced the required functionality into kernel 4.17
At first glance, It seems like backport effort seems tractable, but I lack experience with backporting so this effort would probably be non-trivial for me. If I get some help to help expedite the backport effort, it might make sense for us to backport this.
ma...@google.com <ma...@google.com> #16
<triage> We'll wait until 2 weeks after M99 hits stable for Billy's metric to collect data and reassess the priority then.
Bugjuggler: wait until 2022-3-14 -> billyzhao
bu...@google.com <bu...@google.com> #17
ku...@google.com <ku...@google.com> #18
<Triage notes> Since the next steps are mentioned in
je...@google.com <je...@google.com> #19
re ("nl80211: Control port over nl80211 helpers")
, there's been upstream activity about specifically prioritizing control traffic (
bu...@google.com <bu...@google.com>
bi...@google.com <bi...@google.com> #20
I have taken a look at the metric stats (metric name : TimeFromRekeyToFailureSeconds, internal link not included due to this bug being on the public tracker). We do not see many reproductions of a connection failure shortly after rekey attempt. Marking untriaged and unassigning to discuss in triage.
no...@google.com <no...@google.com> #21
<triage> Kernel uprev in M101 should mitigate the issue. Metric mentioned in
Assigning to bugjuggler, wait until 4 weeks after M101 reaches stable.
Bugjuggler: wait until 2022-05-31
bu...@google.com <bu...@google.com> #22
bu...@google.com <bu...@google.com>
ma...@google.com <ma...@google.com> #23
<triage> Jun to verify whether or not kernel uprev fixed this issue.
ju...@google.com <ju...@google.com> #24
The uprev schedule on octopus has been pushed back to kernel 5.15 on M107 according to go/uprev-cal
By looking at the other boards with the same wifi chipset Intel AC7265:
- asuka uprev 4.19
- cave uprev 4.19
- coral not uprev 4.4
The factors may impact the rekey failures may be regions, pillars or etc. I still need the octopus UMA data after M107 stable to evaluate the kernel patch.
Bugjuggler: wait until 2022-10-27 -> junyuu@
bu...@google.com <bu...@google.com> #25
bu...@google.com <bu...@google.com>
ju...@google.com <ju...@google.com> #26
Since current rekey metric isn't reliable because it doesn't consider the issue when roam between different BSSID, which itself should be distinguished from same BSSID rekey issue.
Billy is working in the CL
So current metric is premature to verify whether kernel uprev can really fix this rekey issue caused by EAPOL out-of-order. Pushing back the work.
Bugjuggler: -> junyuu
bu...@google.com <bu...@google.com> #27
Bugjuggler:
bu...@google.com <bu...@google.com>
ju...@google.com <ju...@google.com> #28
Billy's improvement to rekey UMA metrics landed
Xamine shows the log signature wlan0: Not associated - Delay processing of received EAPOL frame
still exists on uprev-ed devices like grunt and kernel > 4.17 ones like dedede.
The UPSTREAM patch partially fix the problems during rekeying
ma...@google.com <ma...@google.com> #29
A couple things:
- I think we need to refocus this bug - it's not clear how widespread the user impact of this symptom is. Just because we see
Delay processing of ...
, it doesn't mean that there's user impact. There's a retry mechanism such that as long as the assoc event comes in "soon", we'll still catch the next retransmission of the EAPOL packet. In fact, just spot-checking a couple of the feedback reports, I wasn't able to see any connection issues. It would be better to use FRA to check how often that log line is shortly succeeded by a handshake timeout (possibly look at bugs that Brian worked on in the past linked above where he was able to repro the issue to extract the correct signature). - This bug has nothing to do with rekeying in particular. In fact, I'm struggling to see how this would actually impact rekeying given that this bug is about EAPOL dropping before "associated" message, which we don't need during a rekey. Again, FRA could help us determine whether this actually ever happens during a rekey.
Jun, can you use FRA to run a couple queries so we can correctly set the priority here and potentially just close this bug out if we can't tie to any discernible user impact?
ju...@google.com <ju...@google.com> #30
Investigation of logs "Delay processing ..."
I queried the Feedback reports with following SQLs in
system_logs.name = 'syslog' AND REGEXP_CONTAINS(log_lines.content, "Delay processing of received EAPOL frame") AND "network_connection_failure" IN UNNEST(analyzer_tags)
And the same log line with network_disconnect
in FRQ tags.
Found zero matched report. The Delay processing ...
logs almost have no correlation with WiFi disconnections.
Revisited Brian's bug and repro.
After reading through the original bug, according to Process pending EAPOL
if the latency of EAP message is longer than 200ms.
I ran some queried with
system_logs.name = 'netlog' AND REGEXP_CONTAINS(log_lines.content, "Process pending EAPOL") AND "network_connection_failure" IN UNNEST(analyzer_tags)
I did find some usable feedback reports that depicting the logs "Process pending EAPOL " and close WiFi disconnections. For examples:
2023-02-12T23:50:29.762176Z DEBUG wpa_supplicant[680]: wlan0: Process pending EAPOL frame that was received just before association notification
2023-02-12T23:50:29.762190Z DEBUG wpa_supplicant[680]: wlan0: RX EAPOL from [MAC OUI=a8:9a:93 IFACE=27]
...
2023-02-12T23:50:38.800219Z DEBUG wpa_supplicant[680]: WPA: Derived Key MIC - hexdump(len=16): 7d 15 09 eb b3 85 6e e0 24 d2 ff 11 4f c1 a5 27
2023-02-12T23:50:39.762401Z NOTICE wpa_supplicant[680]: wlan0: Authentication with [MAC OUI=a8:9a:93 IFACE=27] timed out.
2023-02-09T09:42:20.906586Z DEBUG wpa_supplicant[692]: wlan0: Process pending EAPOL frame that was received just before association notification
2023-02-09T09:42:20.906600Z DEBUG wpa_supplicant[692]: wlan0: RX EAPOL from [MAC OUI=80:2a:a8 IFACE=25]
...
2023-02-09T09:42:24.922652Z DEBUG wpa_supplicant[692]: nl80211: Delete station [MAC OUI=80:2a:a8 IFACE=25]
2023-02-09T09:42:24.929000Z DEBUG wpa_supplicant[692]: nl80211: Drv Event 39 (NL80211_CMD_DEAUTHENTICATE) received for wlan0
2023-02-09T09:42:24.929025Z DEBUG wpa_supplicant[692]: nl80211: Deauthenticate event
2023-02-09T09:42:24.929050Z DEBUG wpa_supplicant[692]: wlan0: Event DEAUTH (11) received
2023-02-09T09:42:24.929080Z DEBUG wpa_supplicant[692]: wlan0: Deauthentication notification
2023-02-09T09:42:24.929119Z DEBUG wpa_supplicant[692]: wlan0: * reason 2 (PREV_AUTH_NOT_VALID)
Reconsidering the priority
Need using more of Xamine to determin the priority.
ma...@google.com <ma...@google.com> #31
Nice, thanks Jun!
ju...@google.com <ju...@google.com>
ma...@google.com <ma...@google.com> #32
Removing triage because it seems like we're waiting on priority assessment
ku...@google.com <ku...@google.com> #33
<Triage notes> Jun to investigate the data to evaluate the priority.
ju...@google.com <ju...@google.com> #34
Re-run the unioned query mentioned in
system_logs.name = 'netlog' AND (REGEXP_CONTAINS(log_lines.content, "Process pending EAPOL") OR REGEXP_CONTAINS(log_lines.content, "Delay processing of received EAPOL frame")) AND "network_connection_failure" IN UNNEST(analyzer_tags)
And found out the latest feedback reports matching above query dated on 2023-02-13, which is 2 month early. While I can search the same error messages in
By looking at the code, the message
Regarding the FR filed 2 month ago with message "Process pending EAPOL" and a network_connection_failure
FR tag, for example
2023-02-10 13:34:11.666 5 wpa_supplicant[843]: BSSID [MAC OUI=10:06:ed IFACE=13] ignore list count incremented to 2, ignoring for 10 seconds
2023-02-10 13:34:11.667 6 shill[1045]: INFO shill: [wifi.cc(1162)] WiFi wlan0 supplicant updated DisconnectReason to 15 (4-Way Handshake timeout)
...
2023-02-10 13:34:12.424 7 wpa_supplicant[843]: wlan0: Process pending EAPOL frame that was received just before association notification
2023-02-10 13:34:12.424 7 wpa_supplicant[843]: wlan0: RX EAPOL from [MAC OUI=10:06:ed IFACE=2]
In conclusion: as of today, I found no evidence that "Process pending EAPOL" is related to the likelihood of connection failure. And since FR's don't see such message coincides with connection failure, my suggestion is to close this bug as obsolete.
PS: If FRA and Xamine can compare timestamp of logs with starting/ending time of FRA events, it will be helpful to do deeper analysis.
ma...@google.com <ma...@google.com> #35
Can you instead directly right some FRA queries instead of using analyzer tags? I'm not convinced analyzer tags are giving you the best signal. The second report you linked in
ju...@google.com <ju...@google.com>
ma...@google.com <ma...@google.com> #36
This bug is 3 years old and it's unclear if there's meaningful impact. Unassigning and backlogging.
Description
Forked off this (and other comments from that bugs):
"The "pending EAPOL" handling only lasts for 100ms -- after that, wpa_supplicant will just discard the EAPOL. The customer logs show a delay of approx 160ms. That would be enough for us to (silently, unfortunately) drop the pending packet, apparently."
---
The TL;DR: it's possible for wpa_supplicant to receive the initiation for a 4-way handshake (EAPOL, a data packet) before it processes the nl80211 message to say that the network is connected (associated). This causes it to dump this message, and postpone the handshake:
DEBUG wpa_supplicant[721]: wlan0: Not associated - Delay processing of received EAPOL frame (state=ASSOCIATING bssid=00:00:00:00:00:00)
Later, there's an arbitrary time threshold -- if we don't get around to handling this handshake within XXX milliseconds, we totally drop it. This happens to cause interop problems with certain APs (e.g., smartphone hotspots) which don't retry the EAPOL initiation.
Patches like this improve the situation:
but they're not foolproof. There are definitely still occasions where we see more than a 200ms skew.
---
Filing this bug to track coming up with a better (more reliable) solution than a time-based drop mechanism. The original issue is resolved, but it'd be good to see if we can do better.