Status Update
Comments
we...@google.com <we...@google.com> #2
di...@google.com <di...@google.com> #3
we...@google.com <we...@google.com> #4
we...@google.com <we...@google.com> #5
we...@google.com <we...@google.com> #6
di...@google.com <di...@google.com> #7
Submitted again, as new issue here:
we...@google.com <we...@google.com> #8
What I found most interesting was that just enabling the ARM64_PSEUDO_NMI compile time option but not the runtime option gave some performance improvement.
Maybe we could enable the Kconfig option for now and wait awhile before applying the rest and turning on the runtime option?
sw...@google.com <sw...@google.com> #9
What I found most interesting was that just enabling the ARM64_PSEUDO_NMI compile time option but not the runtime option gave some performance improvement.
Is there any explanation for that? It would be good to understand why it gives an improvement so that we don't inadvertently disable some cpu vulnerability mitigation that we need enabled. If it truly is an improvement then I wonder why whatever the improvement is isn't always available.
di...@google.com <di...@google.com> #10
If it truly is an improvement then I wonder why whatever the improvement is isn't always available.
It does seem weird. I think the only difference here is that it adds some static branches which I guess evaluate to a nop
unless the command line parameter is provided? So the only way we'd get a performance improvement is if somehow these extra nop
s caused a speedup due to some weird interactions, maybe with cache lines?
I could certainly understand if irqchip.gicv3_pseudo_nmi=1
somehow caused a performance improvement. It seems plausible that masking using a different mechanism could be faster in some systems and slower in others just like how on some systems it might be faster to write "y = x * 3" and on others "y = x << 2 + 1". ...but that's not what you're seeing here...
we...@google.com <we...@google.com> #11
camera_test_min_buffers_0.log
is the full log from the camera test suite with crrev/c/3168489 applied.
camera_test_min_buffers_n.log
is the full log for the run without the CL applied.
ap...@google.com <ap...@google.com> #12
Branch: chromeos-5.15
commit 399e5d4e576b8bfdc0939b6e62694bbe65b5a9f9
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:25:44 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (all backtrace)
The buddy hard lockup detector should try backtracing on all
CPUs. Right now it doesn't. Copy that bit of logic from the normal
hardlockup detector.
NOTE: On arm64 (the current user of the buddy detector), this won't
(yet) do anything. Soon, hopefully.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id225408348d8a45e68080d08139bc6d9e170000a
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M kernel/watchdog_buddy_cpu.c
ap...@google.com <ap...@google.com> #13
Branch: chromeos-5.15
commit 71986679fe52d94286f9051f09b958ecf582c7fc
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:15:31 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (sysctl)
The CHROMIUM patch accidentally didn't expose the hardlockup panic
sysctls based on the right config. Fix it.
NOTE: Only one of these two sysctls actually does something with the
current buddy detector. You can turn on/off the hard lockup detector
but it doesn't (yet) support tracing other CPUs.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id88d1fb603308e7210c30e42bb6e4e6a4be65a0c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M kernel/sysctl.c
ap...@google.com <ap...@google.com> #14
Branch: chromeos-6.1
commit e21e2990b1d7fbb917a7b37541f91f41670d0d1d
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:25:44 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (all backtrace)
The buddy hard lockup detector should try backtracing on all
CPUs. Right now it doesn't. Copy that bit of logic from the normal
hardlockup detector.
NOTE: On arm64 (the current user of the buddy detector), this won't
(yet) do anything. Soon, hopefully.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id225408348d8a45e68080d08139bc6d9e170000a
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
(cherry picked from commit 399e5d4e576b8bfdc0939b6e62694bbe65b5a9f9)
Reviewed-on:
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
M kernel/watchdog_buddy_cpu.c
ap...@google.com <ap...@google.com> #15
Branch: chromeos-6.1
commit 0351e3dbd6dce50087d5b4cb4698c7ceacb93bfa
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:15:31 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (sysctl)
The CHROMIUM patch accidentally didn't expose the hardlockup panic
sysctls based on the right config. Fix it.
NOTE: Only one of these two sysctls actually does something with the
current buddy detector. You can turn on/off the hard lockup detector
but it doesn't (yet) support tracing other CPUs.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id88d1fb603308e7210c30e42bb6e4e6a4be65a0c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-on:
M kernel/watchdog.c
di...@google.com <di...@google.com> #16
I'm looking at seeing if I can send out a v8 of Sumit's series. Maybe we can drum up support, or (if not) I guess we could think about landing FROMLIST while we wait...
we...@google.com <we...@google.com> #17
di...@google.com <di...@google.com> #18
Cool! There's also something [1] from MediaTek, though I haven't gone through it and am not sure how they might be connected.
Ah, thanks for the pointer! Looks like Mediatek's patch series is a continuation of the perf-based hard lockup detector. I believe that would work in tandem with the ability to get NMI backtraces. If we actually managed to upstream the buddy lockup detector in
we...@google.com <we...@google.com> #19
There's also
di...@google.com <di...@google.com> #20
There's also
regarding the partitioned PMU interrupts not usable for NMIs b/191948966
I presume the issues that were preventing this are fixed upstream, since the Mediatek patch series seems to re-land the previously-reverted "arm64: Enable perf events based hard lockup detector"
In any case, I've posted this at:
...now we can see what people think! ;-) I included a summary in the cover letter trying to reconcile the various related series...
we...@google.com <we...@google.com> #21
On Juniper running mainline, just enabling pseudo-NMIs causes a bad lockup:
[ 3.047997] ================================
[ 3.047998] WARNING: inconsistent lock state
[ 3.048000] 6.3.0-next-20230428-14437-g7728edc2096a-dirty #628 Not tainted
[ 3.048002] --------------------------------
[ 3.048003] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 3.048005] swapper/4/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 3.048009] ffffff80ff52c358 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2c/0x50
[ 3.048027] {INITIAL USE} state was registered at:
[ 3.048028] lock_acquire+0x1f4/0x340
[ 3.048034] _raw_spin_lock_nested+0x4c/0x70
[ 3.048039] raw_spin_rq_lock_nested+0x2c/0x50
[ 3.048043] rq_attach_root+0x48/0x240
[ 3.048045] sched_init+0x4d8/0x5e8
[ 3.048051] start_kernel+0x410/0x908
[ 3.048054] __primary_switched+0xbc/0xd0
[ 3.048059] irq event stamp: 5004
[ 3.048060] hardirqs last enabled at (5003): [<ffffffeb96a45164>] psci_cpu_suspend_enter+0x19c/0x1f0
[ 3.048066] hardirqs last disabled at (5004): [<ffffffeb96a45164>] psci_cpu_suspend_enter+0x19c/0x1f0
[ 3.048070] softirqs last enabled at (4844): [<ffffffeb960107ec>] __do_softirq+0x43c/0x548
[ 3.048074] softirqs last disabled at (4833): [<ffffffeb96017a48>] ____do_softirq+0x18/0x30
[ 3.048078]
[ 3.048078] other info that might help us debug this:
[ 3.048079] Possible unsafe locking scenario:
[ 3.048079]
[ 3.048080] CPU0
[ 3.048080] ----
[ 3.048081] lock(&rq->__lock);
[ 3.048083] <Interrupt>
[ 3.048084] lock(&rq->__lock);
[ 3.048086]
[ 3.048086] *** DEADLOCK ***
[ 3.048086]
[ 3.048086] 1 lock held by swapper/4/0:
[ 3.048088] #0: ffffffeb9792f000 (rcu_read_lock){....}-{1:2}, at: cpu_pm_notify+0x8/0x140
[ 3.048097]
[ 3.048097] stack backtrace:
[ 3.048099] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.3.0-next-20230428-14437-g7728edc2096a-dirty #628 cfa3ce05a32b511c7b1da71d960a9262f60f029b
[ 3.048103] Hardware name: Google juniper sku16 board (DT)
[ 3.048105] Call trace:
[ 3.048106] dump_backtrace+0xa0/0x100
[ 3.048109] show_stack+0x20/0x38
[ 3.048112] dump_stack_lvl+0xdc/0x148
[ 3.048116] dump_stack+0x18/0x28
[ 3.048118] print_usage_bug.part.0+0x290/0x358
[ 3.048122] lock_acquire+0x2f4/0x340
[ 3.048125] _raw_spin_lock_nested+0x4c/0x70
[ 3.048127] raw_spin_rq_lock_nested+0x2c/0x50
[ 3.048131] sched_ttwu_pending+0x78/0x248
[ 3.048133] __flush_smp_call_function_queue+0xe4/0x2e8
[ 3.048139] generic_smp_call_function_single_interrupt+0x1c/0x30
[ 3.048142] ipi_handler+0x224/0x240
[ 3.048147] handle_percpu_devid_irq+0x94/0x158
[ 3.048151] generic_handle_domain_nmi+0x44/0x70
[ 3.048156] __gic_handle_nmi.constprop.0+0x58/0xb0
[ 3.048161] gic_handle_irq+0x284/0x2c8
[ 3.048164] call_on_irq_stack+0x24/0x58
[ 3.048167] do_interrupt_handler+0x88/0x98
[ 3.048171] el1_interrupt+0xb0/0xe8
[ 3.048174] el1h_64_irq_handler+0x18/0x28
[ 3.048177] el1h_64_irq+0x7c/0x80
[ 3.048179] gic_cpu_sys_reg_init+0x108/0x2b8
[ 3.048183] gic_cpu_pm_notifier+0x78/0xa8
[ 3.048186] notifier_call_chain+0xac/0x188
[ 3.048189] raw_notifier_call_chain+0x20/0x38
[ 3.048191] cpu_pm_notify+0x68/0x140
[ 3.048194] cpu_pm_exit+0x1c/0x30
[ 3.048196] psci_enter_idle_state+0x40/0x78
[ 3.048200] cpuidle_enter_state+0xe8/0x5a0
[ 3.048203] cpuidle_enter+0x40/0x60
[ 3.048208] do_idle+0x284/0x308
[ 3.048210] cpu_startup_entry+0x2c/0x40
[ 3.048213] secondary_start_kernel+0x160/0x1c8
[ 3.048217] __secondary_switched+0xb8/0xc0
[ 3.048373] hub 1-1:1.0: USB hub found
[ 3.048859] ------------[ cut here ]------------
[ 3.048861] WARNING: CPU: 4 PID: 93 at arch/arm64/include/asm/irqflags.h:70 _raw_spin_lock_irq+0xb4/0xe8
[ 3.048867] Modules linked in:
[ 3.048870] CPU: 4 PID: 93 Comm: kworker/4:2 Not tainted 6.3.0-next-20230428-14437-g7728edc2096a-dirty #628 cfa3ce05a32b511c7b1da71d960a9262f60f029b
[ 3.048874] Hardware name: Google juniper sku16 board (DT)
[ 3.048877] Workqueue: usb_hub_wq hub_event
[ 3.048882] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3.048885] pc : _raw_spin_lock_irq+0xb4/0xe8
[ 3.048888] lr : __hrtimer_run_queues+0x1d0/0x3f0
[ 3.048891] sp : ffffffc008023e50
[ 3.048893] pmr_save: 000000f0
[ 3.048894] x29: ffffffc008023e50 x28: ffffff80ff51d998 x27: ffffffeb97826d5c
[ 3.048898] x26: ffffff80ff51d080 x25: ffffffeb9744d008 x24: 0000000000000001
[ 3.048902] x23: 0000000000000000 x22: ffffff80ff51d110 x21: ffffffeb97823b60
[ 3.048906] x20: ffffffeb9618db30 x19: ffffff80ff51d080 x18: ffffffffffffffff
[ 3.048910] x17: ffffff95680ca000 x16: ffffffc008020000 x15: ffffffffffffffff
[ 3.048913] x14: 0000000000000001 x13: fffffffffff88807 x12: ffffffffffffffff
[ 3.048917] x11: 0000000000000040 x10: ffffff80c0013d58 x9 : ffffffeb9618db30
[ 3.048921] x8 : 00000000a0b1b200 x7 : 7fffffffffffffff x6 : 00000000a0b1b200
[ 3.048924] x5 : 00000000a0b1b200 x4 : ffffff80ff51d998 x3 : 00000000a0c0f440
[ 3.048928] x2 : 00000000000000a0 x1 : 00000000000000f0 x0 : 0000000000000001
[ 3.048931] Call trace:
[ 3.048932] _raw_spin_lock_irq+0xb4/0xe8
[ 3.048935] __hrtimer_run_queues+0x1d0/0x3f0
[ 3.048937] hrtimer_interrupt+0xf0/0x258
[ 3.048940] arch_timer_handler_phys+0x34/0x58
[ 3.048944] handle_percpu_devid_irq+0x94/0x158
[ 3.048947] generic_handle_domain_nmi+0x44/0x70
[ 3.048951] __gic_handle_nmi.constprop.0+0x58/0xb0
[ 3.048954] gic_handle_irq+0x284/0x2c8
[ 3.048956] call_on_irq_stack+0x24/0x58
[ 3.048959] do_interrupt_handler+0x88/0x98
[ 3.048962] el1_interrupt+0xb0/0xe8
[ 3.048965] el1h_64_irq_handler+0x18/0x28
[ 3.048968] el1h_64_irq+0x7c/0x80
[ 3.048970] vprintk_emit+0x2b0/0x3e0
[ 3.048974] dev_vprintk_emit+0xe0/0x1b0
[ 3.048979] dev_printk_emit+0x60/0x90
[ 3.048982] __dev_printk+0x44/0x98
[ 3.048985] _dev_info+0x68/0x98
[ 3.048988] hub_probe+0x13c/0xa18
[ 3.048990] usb_probe_interface+0xdc/0x290
[ 3.048995] really_probe+0x150/0x2c0
[ 3.048998] __driver_probe_device+0x80/0x140
[ 3.049001] driver_probe_device+0xe0/0x170
[ 3.049003] __device_attach_driver+0xc0/0x148
[ 3.049005] bus_for_each_drv+0x88/0xf0
[ 3.049009] __device_attach+0xa4/0x198
[ 3.049012] device_initial_probe+0x1c/0x30
[ 3.049014] bus_probe_device+0xb4/0xc0
[ 3.049016] device_add+0x5a8/0x778
[ 3.049019] usb_set_configuration+0x50c/0x8d0
[ 3.049022] usb_generic_driver_probe+0x78/0xd0
[ 3.049026] usb_probe_device+0x88/0x108
[ 3.049029] really_probe+0x150/0x2c0
[ 3.049031] __driver_probe_device+0x80/0x140
[ 3.049033] driver_probe_device+0xe0/0x170
[ 3.049035] __device_attach_driver+0xc0/0x148
[ 3.049037] bus_for_each_drv+0x88/0xf0
[ 3.049041] __device_attach+0xa4/0x198
[ 3.049043] device_initial_probe+0x1c/0x30
[ 3.049045] bus_probe_device+0xb4/0xc0
[ 3.049047] device_add+0x5a8/0x778
[ 3.049050] usb_new_device+0x24c/0x4b8
[ 3.049052] hub_event+0xedc/0x15f8
[ 3.049054] process_one_work+0x2c0/0x5b8
[ 3.049059] worker_thread+0x7c/0x458
[ 3.049063] kthread+0x128/0x138
[ 3.049066] ret_from_fork+0x10/0x20
[ 3.049070] irq event stamp: 1860
[ 3.049071] hardirqs last enabled at (1859): [<ffffffeb96db5d18>] _raw_spin_unlock_irq+0x38/0xb0
[ 3.049074] hardirqs last disabled at (1860): [<ffffffeb96dacd48>] __schedule+0x4c8/0xd30
[ 3.049078] softirqs last enabled at (1730): [<ffffffeb960107ec>] __do_softirq+0x43c/0x548
[ 3.049081] softirqs last disabled at (1725): [<ffffffeb96017a48>] ____do_softirq+0x18/0x30
[ 3.049084] ---[ end trace 0000000000000000 ]---
di...@google.com <di...@google.com> #22
On Juniper running mainline, just enabling pseudo-NMIs causes a bad lockup:
That's super weird. So far it's been working fine for me on trogdor, but I'm not on kernel-next. I'll try to take a peek at this tomorrow or early next week and see if anything makes sense.
di...@google.com <di...@google.com> #23
FWIW, I tried putting 6.3.0-next-20230505
on a trogdor device (even compiled with KASAN and lockdep) and I didn't see the same problem.
I then tried on kukui-kodama and I reproduced the same problem you did. Problem also reproduces on Linus's tree (v6.1-45701-g7163a2111f6c
). I went back to vanilla kernel 5.15 and I still see the problem, so it's not new. Then I went back to vanilla v5.13 (plus crrev.com/c/3522894 and crrev.com/c/3522895 to make it compile) and I reproduced it there too. v5.13 is the first place kodama was supported upstream.
...I then tried chromeos-5.10 and tried turning on the option. It's also broken there on kukui.
Unless we really think this used to work, bisect doesn't seem like it'll help.
di...@google.com <di...@google.com> #24
I've been poking at this a bunch, but it's pretty unfamiliar code to me. While researching, I found that Marc seems to have fixed a bunch of similar-sounding issues in the past, so adding him to CC on the off chance that he wants to offer any insight or has ideas of things to try. ;-)
As far as I can tell, the problem here is that somehow the system has decided that the IPI handler for calling a function on another CPU should be running as a NMI. That's obviously not right.
di...@google.com <di...@google.com> #25
FWIW:
CONFIG_ARM64_DEBUG_PRIORITY_MASKING=y
didn't show anything.- I don't know if it matters, but I did notice one difference between kukui (where things are broken) and trogdor (where things work):
kukui (doesn't work):
[ 0.000000] GICv3: DOUG: has_group0(): 1, gic_dist_security_disabled(): 0
trogdor (works)
[ 0.000000] GICv3: DOUG: has_group0(): 0, gic_dist_security_disabled(): 0
Could the "group0" difference be what's mattering here? I tried hacking has_group0()
to false on kukui and kukui could boot, but pseudo-NMI didn't work...
di...@google.com <di...@google.com> #26
FWIW, kevin (rk3999) matches kukui for at least those two properties, but works:
[ 0.000000] GICv3: DOUG: has_group0(): 1, gic_dist_security_disabled(): 0
...so at least it's not just some generic problem with that config. Back to digging into what's different with kukui.
di...@google.com <di...@google.com> #27
OK, so it looks as if somehow the GIC is losing state when CPUs go idle and that's messing things up. Evidence:
- I can "fix" things by just commenting out the
cpu-idle-states
in the device tree and then I don't get the crash and I also can use pseudo-NMI. - I dropped into kgdb the moment I first got the unexpected pseudo-NMI. When I did, I found that some of the processor's GIC configuration had been reset to 0.
It looks like the GIC already gets involved when CPUs come out of idle, so something seems like it's wrong with gic_cpu_sys_reg_init(). Maybe it's somehow not re-initting things properly. I'll keep digging...
di...@google.com <di...@google.com> #28
OK, so
I ran out of time to today to do full testing, but I believe
we...@google.com <we...@google.com> #29
Maybe on Rockchip the GIC never loses power, and thus retains its register state?
mz...@google.com <mz...@google.com> #30
This really looks like a firmware issue. On RK3399, ATF is in charge of save/restoring the GIC state across PM events.
Hacking the kernel to save/restore priorities (or any other GIC distributor state) for that is unlikely to result in something that really works, as the kernel running in non-secure cannot configure the state of secure interrupts.
I have pushed back on this in the past, and will continue to do so.
we...@google.com <we...@google.com> #31
A quick skim through ATF seems to indicate that MediaTek ATF is not saving/restoring the state of the redistributor (gicv3_rdistif_save()
and gicv3_rdistif_init_restore()
) are never called.
mz...@google.com <mz...@google.com> #32
That alone would screw up all SGIs and PPIs. Good stuff... :-/
yi...@google.com <yi...@google.com> #33
MTK saves/restores the state of the redistributor using their own implementation. (mt_gic_rdistif_save()
and mt_gic_rdistif_restore()
)
@jason-ch, could you ask MTK GIC owner to take a look at this issue ?
di...@google.com <di...@google.com> #34
FWIW: I've forked the ATF issue to the (also public)
Hacking the kernel to save/restore priorities (or any other GIC distributor state) for that is unlikely to result in something that really works [...]
Marc: it sounds as if the Mediatek ATF is saving / restoring most of the GIC state, but just not these priorities. Presumably that means that it's configuring any state of secure interrupts that are important and we're just missing the priorities. Given that the kernel was the one that sets these priorities in the first place, it must have access to them, right? Would you object to a quirk to workaround this Mediatek issue? The problem is that even if we fix the Mediatek ATF implementation, anyone running with old firmware would continue to have this issue. In the very least, recovery images would need to continue to work and those always boot from read-only (non-updatable) firmware.
I'd also be interested if anyone has suggestions for how we should detect this issue. The kernel needs to be able to detect this so that it can either enable the workaround or disable pseudo-NMI. Certainly we could go the "dt" route, but it feels more reliable if we could query ATF directly. Maybe some PSCI call that tells us that we're on Mediatek today, and then we query some other function that currently returns 0. Then, when ATF is fixed we make that function return 1? I'll do a little digging to see if I see anything.
mz...@google.com <mz...@google.com> #35
[I don't know which bug to follow anymore]
The minute I open the door to one such thing, I'll end-up with 20 variation of the same stuff, such as this horror:
So no, I'm not taking save/restore code for anything that is the responsibility of ATF and for which the architecture provides the adequate level of support (and when it doesn't, I'm happy to oblige, such as the collection replay on GIC500).
As for filtering the enablement, SOC_ID would be one way out, but nobody implements it. I'm not entertaining SOC-specific SMC calls, as if the vendor can be bothered to implement such a call, they can also implement SOC_ID. If you want to disable it for a whole SoC family, that's fine by me.
But it seems to me that you know exactly what platform you're running on (each board family has a separate kernel?), and you get to construct the command-line that enables it. Why not avoiding on the affected platforms, since they are unusable with this feature?
di...@google.com <di...@google.com> #36
[I don't know which bug to follow anymore]
I was hoping to move the MTK issue in general to the other bug, but oh well. Then this bug could remain about enabling pseudo-NMIs in general, which is what it was originally about. :-P
So no, I'm not taking save/restore code for anything that is the responsibility of ATF and for which the architecture provides the adequate level of support
OK, fair enough. I'll focus on just disabling pseudo-NMIs if I detect that they're broken.
As for filtering the enablement, SOC_ID would be one way out, but nobody implements it. I'm not entertaining SOC-specific SMC calls, as if the vendor can be bothered to implement such a call, they can also implement SOC_ID. If you want to disable it for a whole SoC family, that's fine by me.
After spending a little time on it, I think maybe just using DT and avoiding the SMC calls is the right way to go, but I'll see if I still agree when I get the full prototype, and then I guess see what folks on the list say.
But it seems to me that you know exactly what platform you're running on (each board family has a separate kernel?), and you get to construct the command-line that enables it. Why not avoiding on the affected platforms, since they are unusable with this feature?
The idea would be to actually fix the firmware and enable this feature. Then we'd have one build that needs to run on devices that have the old (broken) firmware and the new fixed firmware, so we'd need to detect. Technically we could teach the new firmware to add the kernel command line argument, but that feels a little awkward. It could also cause problem in the future if we ever wanted to make a universal arm64 build that worked on a variety of arm64 boards since we don't want to have to teach every architecture's bootloader to add irqchip.gicv3_pseudo_nmi=1
di...@google.com <di...@google.com> #37
FWIW: I've posted patches for the kernel to disable pseudo-NMI support on the affected devices. See
ap...@google.com <ap...@google.com> #38
Branch: chromeos-6.1
commit 0b11568653136f0e6871f2382bdd97387c5f9ca5
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:52 2023
CHROMIUM: arm64: dts: mediatek: mt8186: Add mediatek,broken-save-restore-fw to corsola
Firmware shipped on mt8186 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
UPSTREAM-TASK=b:213000788
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Id23d6aa00a713edbf3caf2d8b81a9d7ac43a1d32
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Yidi Lin <yidilin@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8186-corsola.dtsi
ap...@google.com <ap...@google.com> #39
Branch: chromeos-6.1
commit c09b9155424d210977e0daa04eedc00c5915fc77
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:53 2023
FROMGIT: arm64: dts: mediatek: mt8192: Add mediatek,broken-save-restore-fw to asurada
Firmware shipped on mt8192 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
Fixes: 331fae2fc922 ("arm64: dts: mediatek: Introduce MT8192-based Asurada board family")
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link:
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
(cherry picked from commit d72cfbd6fcf7cd02084991eee47ecc9f4b4c1e69
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Ie7e600278ffbed55a1e5a58178203787b1449b35
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8192-asurada.dtsi
ap...@google.com <ap...@google.com> #40
Branch: chromeos-6.1
commit e081c38913135102b02bbcbfc9bf79046cd9f2a0
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:52 2023
FROMGIT: arm64: dts: mediatek: mt8183: Add mediatek,broken-save-restore-fw to kukui
Firmware shipped on mt8183 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
Fixes: cd894e274b74 ("arm64: dts: mt8183: Add krane-sku176 board")
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link:
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
(cherry picked from commit 42127f578ebde652d1373e0233356fbd351675c4
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: I525a2ed4260046d43c885ee1275e91707743df1c
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi
ap...@google.com <ap...@google.com> #41
Branch: chromeos-6.1
commit e271e3e5da8dd6561430411690a5ca4075708259
Author: Marc Zyngier <maz@kernel.org>
Date: Tue May 30 11:01:22 2023
FROMGIT: irqchip/gic: Correctly validate OF quirk descriptors
When checking for OF quirks, make sure either 'compatible' or 'property'
is set, and give up otherwise.
This avoids non-OF quirks being randomly applied as they don't have any
of the OF data that need checking.
Cc: Douglas Anderson <dianders@chromium.org>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 44bd78dd2b88 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 91539341a3b6e9c868024a4292455dae36e6f58c
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: I6fcac014ed9cc4f370383bfe59cef16846e8dfa7
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M drivers/irqchip/irq-gic-common.c
ap...@google.com <ap...@google.com> #42
Branch: chromeos-6.1
commit eb3bc1cde6325cc64dd11e4b3f71456b001e8cc1
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:51 2023
UPSTREAM: irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues
Some Chromebooks with Mediatek SoCs have a problem where the firmware
doesn't properly save/restore certain GICR registers. Newer
Chromebooks should fix this issue and we may be able to do firmware
updates for old Chromebooks. At the moment, the only known issue with
these Chromebooks is that we can't enable "pseudo NMIs" since the
priority register can be lost. Enabling "pseudo NMIs" on Chromebooks
with the problematic firmware causes crashes and freezes.
Let's detect devices with this problem and then disable "pseudo NMIs"
on them. We'll detect the problem by looking for the presence of the
"mediatek,broken-save-restore-fw" property in the GIC device tree
node. Any devices with fixed firmware will not have this property.
Our detection plan works because we never bake a Chromebook's device
tree into firmware. Instead, device trees are always bundled with the
kernel. We'll update the device trees of all affected Chromebooks and
then we'll never enable "pseudo NMI" on a kernel that is bundled with
old device trees. When a firmware update is shipped that fixes this
issue it will know to patch the device tree to remove the property.
In order to make this work, the quick detection mechanism of the GICv3
code is extended to be able to look for properties in addition to
looking at "compatible".
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 44bd78dd2b8897f59b7e3963f088caadb7e4f047)
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Cq-Depend: chromium:4575429
Change-Id: I88dc0a0eb1d9d537de61604cd8994ecc55c0cac1
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M drivers/irqchip/irq-gic-common.c
M drivers/irqchip/irq-gic-common.h
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #43
Branch: chromeos-6.1
commit 650f7a9c07729cc7fabd0c4332cb616cf9b5eae1
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:50 2023
UPSTREAM: dt-bindings: interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/ broken FW
When trying to turn on the "pseudo NMI" kernel feature in Linux, it
was discovered that all Mediatek-based Chromebooks that ever shipped
(at least ones with GICv3) had a firmware bug where they wouldn't save
certain GIC "GICR" registers properly. If a processor ever entered a
suspend/idle mode where the GICR registers lost state then they'd be
reset to their default state.
As a result of the bug, if you try to enable "pseudo NMIs" on the
affected devices then certain interrupts will unexpectedly get
promoted to be "pseudo NMIs" and cause crashes / freezes / general
mayhem.
ChromeOS is looking to start turning on "pseudo NMIs" in production to
make crash reports more actionable. To do so, we will release firmware
updates for at least some of the affected Mediatek Chromebooks.
However, even when we update the firmware of a Chromebook it's always
possible that a user will end up booting with old firmware. We need to
be able to detect when we're running with firmware that will crash and
burn if pseudo NMIs are enabled.
The current plan is:
* Update the device trees of all affected Chromebooks to include the
'mediatek,broken-save-restore-fw' property. The kernel can use this
to know not to enable certain features like "pseudo NMI". NOTE:
device trees for Chromebooks are never baked into the firmware but
are bundled with the kernel. A kernel will never be configured to
use "pseudo NMIs" and be bundled with an old device tree.
* When we get a fixed firmware for one of these Chromebooks, it will
patch the device tree to remove this property.
For some details, you can also see the public bug
<
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 43cd3ddbff3c1635d0e09fe5b09af48d39dbb9d7)
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Iabe67a827e206496efec6beb5616d5a3b99c1e65
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
ap...@google.com <ap...@google.com> #44
Branch: chromeos-5.15
commit db42af38816763ecf948c7b8eeac7d6aa4eaaac3
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:52 2023
CHROMIUM: arm64: dts: mediatek: mt8186: Add mediatek,broken-save-restore-fw to corsola
Firmware shipped on mt8186 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
UPSTREAM-TASK=b:213000788
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Id23d6aa00a713edbf3caf2d8b81a9d7ac43a1d32
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Yidi Lin <yidilin@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8186-corsola.dtsi
ap...@google.com <ap...@google.com> #45
Branch: chromeos-5.15
commit d49bac2f6f07dc5f552a250f65cafc985113e03d
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:52 2023
FROMGIT: arm64: dts: mediatek: mt8183: Add mediatek,broken-save-restore-fw to kukui
Firmware shipped on mt8183 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
Fixes: cd894e274b74 ("arm64: dts: mt8183: Add krane-sku176 board")
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link:
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
(cherry picked from commit 42127f578ebde652d1373e0233356fbd351675c4
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: I525a2ed4260046d43c885ee1275e91707743df1c
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi
ap...@google.com <ap...@google.com> #46
Branch: chromeos-5.15
commit c0e75cb6122ae787c2110d085473d1381f7cfdaa
Author: Marc Zyngier <maz@kernel.org>
Date: Tue May 30 11:01:22 2023
FROMGIT: irqchip/gic: Correctly validate OF quirk descriptors
When checking for OF quirks, make sure either 'compatible' or 'property'
is set, and give up otherwise.
This avoids non-OF quirks being randomly applied as they don't have any
of the OF data that need checking.
Cc: Douglas Anderson <dianders@chromium.org>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 44bd78dd2b88 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 91539341a3b6e9c868024a4292455dae36e6f58c
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: I6fcac014ed9cc4f370383bfe59cef16846e8dfa7
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
M drivers/irqchip/irq-gic-common.c
ap...@google.com <ap...@google.com> #47
Branch: chromeos-5.15
commit d4fa676412a723561cecb62730538d03e4937271
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:51 2023
UPSTREAM: irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues
Some Chromebooks with Mediatek SoCs have a problem where the firmware
doesn't properly save/restore certain GICR registers. Newer
Chromebooks should fix this issue and we may be able to do firmware
updates for old Chromebooks. At the moment, the only known issue with
these Chromebooks is that we can't enable "pseudo NMIs" since the
priority register can be lost. Enabling "pseudo NMIs" on Chromebooks
with the problematic firmware causes crashes and freezes.
Let's detect devices with this problem and then disable "pseudo NMIs"
on them. We'll detect the problem by looking for the presence of the
"mediatek,broken-save-restore-fw" property in the GIC device tree
node. Any devices with fixed firmware will not have this property.
Our detection plan works because we never bake a Chromebook's device
tree into firmware. Instead, device trees are always bundled with the
kernel. We'll update the device trees of all affected Chromebooks and
then we'll never enable "pseudo NMI" on a kernel that is bundled with
old device trees. When a firmware update is shipped that fixes this
issue it will know to patch the device tree to remove the property.
In order to make this work, the quick detection mechanism of the GICv3
code is extended to be able to look for properties in addition to
looking at "compatible".
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 44bd78dd2b8897f59b7e3963f088caadb7e4f047)
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Cq-Depend: chromium:4573419
Change-Id: I88dc0a0eb1d9d537de61604cd8994ecc55c0cac1
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M drivers/irqchip/irq-gic-common.c
M drivers/irqchip/irq-gic-common.h
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #48
Branch: chromeos-5.15
commit 11cda9cc4a1ac84346f51125cfcf724632b79a38
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:50 2023
UPSTREAM: dt-bindings: interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/ broken FW
When trying to turn on the "pseudo NMI" kernel feature in Linux, it
was discovered that all Mediatek-based Chromebooks that ever shipped
(at least ones with GICv3) had a firmware bug where they wouldn't save
certain GIC "GICR" registers properly. If a processor ever entered a
suspend/idle mode where the GICR registers lost state then they'd be
reset to their default state.
As a result of the bug, if you try to enable "pseudo NMIs" on the
affected devices then certain interrupts will unexpectedly get
promoted to be "pseudo NMIs" and cause crashes / freezes / general
mayhem.
ChromeOS is looking to start turning on "pseudo NMIs" in production to
make crash reports more actionable. To do so, we will release firmware
updates for at least some of the affected Mediatek Chromebooks.
However, even when we update the firmware of a Chromebook it's always
possible that a user will end up booting with old firmware. We need to
be able to detect when we're running with firmware that will crash and
burn if pseudo NMIs are enabled.
The current plan is:
* Update the device trees of all affected Chromebooks to include the
'mediatek,broken-save-restore-fw' property. The kernel can use this
to know not to enable certain features like "pseudo NMI". NOTE:
device trees for Chromebooks are never baked into the firmware but
are bundled with the kernel. A kernel will never be configured to
use "pseudo NMIs" and be bundled with an old device tree.
* When we get a fixed firmware for one of these Chromebooks, it will
patch the device tree to remove this property.
For some details, you can also see the public bug
<
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 43cd3ddbff3c1635d0e09fe5b09af48d39dbb9d7)
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Iabe67a827e206496efec6beb5616d5a3b99c1e65
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
ap...@google.com <ap...@google.com> #49
Branch: chromeos-5.10
commit b3d9986e39ae807a2f70e324065a87c39745e431
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:52 2023
FROMGIT: arm64: dts: mediatek: mt8183: Add mediatek,broken-save-restore-fw to kukui
Firmware shipped on mt8183 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
Fixes: cd894e274b74 ("arm64: dts: mt8183: Add krane-sku176 board")
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link:
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
(cherry picked from commit 42127f578ebde652d1373e0233356fbd351675c4
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: I525a2ed4260046d43c885ee1275e91707743df1c
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8183-kukui.dtsi
ap...@google.com <ap...@google.com> #50
Branch: chromeos-5.10
commit 6c0be5119c2634d271733ca08efa188134a24208
Author: Marc Zyngier <maz@kernel.org>
Date: Tue May 30 11:01:22 2023
FROMGIT: irqchip/gic: Correctly validate OF quirk descriptors
When checking for OF quirks, make sure either 'compatible' or 'property'
is set, and give up otherwise.
This avoids non-OF quirks being randomly applied as they don't have any
of the OF data that need checking.
Cc: Douglas Anderson <dianders@chromium.org>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 44bd78dd2b88 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 91539341a3b6e9c868024a4292455dae36e6f58c
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: I6fcac014ed9cc4f370383bfe59cef16846e8dfa7
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M drivers/irqchip/irq-gic-common.c
ap...@google.com <ap...@google.com> #51
Branch: chromeos-5.10
commit 825d270c8ffd5ccd787399e0142b86cb869ad31b
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:51 2023
UPSTREAM: irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues
Some Chromebooks with Mediatek SoCs have a problem where the firmware
doesn't properly save/restore certain GICR registers. Newer
Chromebooks should fix this issue and we may be able to do firmware
updates for old Chromebooks. At the moment, the only known issue with
these Chromebooks is that we can't enable "pseudo NMIs" since the
priority register can be lost. Enabling "pseudo NMIs" on Chromebooks
with the problematic firmware causes crashes and freezes.
Let's detect devices with this problem and then disable "pseudo NMIs"
on them. We'll detect the problem by looking for the presence of the
"mediatek,broken-save-restore-fw" property in the GIC device tree
node. Any devices with fixed firmware will not have this property.
Our detection plan works because we never bake a Chromebook's device
tree into firmware. Instead, device trees are always bundled with the
kernel. We'll update the device trees of all affected Chromebooks and
then we'll never enable "pseudo NMI" on a kernel that is bundled with
old device trees. When a firmware update is shipped that fixes this
issue it will know to patch the device tree to remove the property.
In order to make this work, the quick detection mechanism of the GICv3
code is extended to be able to look for properties in addition to
looking at "compatible".
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 44bd78dd2b8897f59b7e3963f088caadb7e4f047)
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Cq-Depend: chromium:4575424
Change-Id: I88dc0a0eb1d9d537de61604cd8994ecc55c0cac1
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M drivers/irqchip/irq-gic-common.c
M drivers/irqchip/irq-gic-common.h
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #52
Branch: chromeos-5.10
commit bc04f2df0f4e903e7bed8169404f6c6580c722d3
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:50 2023
UPSTREAM: dt-bindings: interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/ broken FW
When trying to turn on the "pseudo NMI" kernel feature in Linux, it
was discovered that all Mediatek-based Chromebooks that ever shipped
(at least ones with GICv3) had a firmware bug where they wouldn't save
certain GIC "GICR" registers properly. If a processor ever entered a
suspend/idle mode where the GICR registers lost state then they'd be
reset to their default state.
As a result of the bug, if you try to enable "pseudo NMIs" on the
affected devices then certain interrupts will unexpectedly get
promoted to be "pseudo NMIs" and cause crashes / freezes / general
mayhem.
ChromeOS is looking to start turning on "pseudo NMIs" in production to
make crash reports more actionable. To do so, we will release firmware
updates for at least some of the affected Mediatek Chromebooks.
However, even when we update the firmware of a Chromebook it's always
possible that a user will end up booting with old firmware. We need to
be able to detect when we're running with firmware that will crash and
burn if pseudo NMIs are enabled.
The current plan is:
* Update the device trees of all affected Chromebooks to include the
'mediatek,broken-save-restore-fw' property. The kernel can use this
to know not to enable certain features like "pseudo NMI". NOTE:
device trees for Chromebooks are never baked into the firmware but
are bundled with the kernel. A kernel will never be configured to
use "pseudo NMIs" and be bundled with an old device tree.
* When we get a fixed firmware for one of these Chromebooks, it will
patch the device tree to remove this property.
For some details, you can also see the public bug
<
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 43cd3ddbff3c1635d0e09fe5b09af48d39dbb9d7)
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Iabe67a827e206496efec6beb5616d5a3b99c1e65
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
ap...@google.com <ap...@google.com> #53
Branch: chromeos-5.10
commit e1fbef4c7d1bbb1591016df2b073e7fd98206bdd
Author: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Date: Wed Jun 09 16:51:08 2021
UPSTREAM: dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties
Describe the optional GICv3 properties:
- clocks
- clock-names
- power-domains
- resets
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 4e08a559a18c1b6424e56859c74adb4b29c17318)
BUG=b:281831288, b:197061987
TEST=Future patches easier to pick
Change-Id: Ic248e25042a769291845ce2bd85e6157352b1140
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
M Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml
ap...@google.com <ap...@google.com> #54
Branch: chromeos-5.10
commit 42f8fc5b6a8c66c47f24e01aae5862c3b94724b9
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon May 15 13:13:54 2023
FROMGIT: arm64: dts: mediatek: mt8195: Add mediatek,broken-save-restore-fw to cherry
Firmware shipped on mt8195 Chromebooks is affected by the GICR
save/restore issue as described by the patch ("dt-bindings:
interrupt-controller: arm,gic-v3: Add quirk for Mediatek SoCs w/
broken FW"). Add the quirk property.
Fixes: 5eb2e303ec6b ("arm64: dts: mediatek: Introduce MT8195 Cherry platform's Tomato")
Reviewed-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link:
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
(cherry picked from commit ea6c5f21efecbaa3a14cb21c5bc0e23c84473a11
git://
BUG=b:281831288, b:197061987
TEST=Pseudo NMIs are disabled on MTK devices
Change-Id: Ia0b6ebbaa351e3cd67e201355b9ae67783c7d718
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Yidi Lin <yidilin@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
M arch/arm64/boot/dts/mediatek/mt8195-cherry.dtsi
di...@google.com <di...@google.com> #55
Mediatek-related quirks have landed, as can be seen above. I put a plan for getting Mediatek Chromebooks fixed in
Plan for getting this rolled out:
- Land upstream fixes (still need +2):
5.15:
https://crrev.com/c/4575505 - BACKPORT: irqchip/gic-v3: Fix priority mask handling
5.10:
https://crrev.com/c/4575725 - BACKPORT: irqchip/gic-v3: Fix priority mask handlinghttps://crrev.com/c/4575724 - UPSTREAM: irqchip/gic-v3: Refactor ISB + EOIR at ack timehttps://crrev.com/c/4575723 - UPSTREAM: irqchip/gic-v3: Ensure pseudo-NMIs have an ISB between ack and handling
-
Figure out how to get the
landed upstream. I'll plan to send out a v9 soon and see if I can drum up any more interest.pseudo-NMI backtrace series -
Re-confirm Chen-Yu's tests earlier that show that turning on pseudo-NMI isn't causing any big performance regressions on all of our hardware.
-
Turn on in arm64 config for ChromeOS 5.10, 5.15, and 6.1 and add the magic kernel command line argument to our build scripts.
Then, I think we're done.
ap...@google.com <ap...@google.com> #56
Branch: chromeos-5.15
commit 5d55d7a2ed23d845bece7a043c81b4effdfc8020
Author: Douglas Anderson <dianders@chromium.org>
Date: Tue May 30 13:59:53 2023
BACKPORT: irqchip/gic-v3: Fix priority mask handling
When a kernel is built with CONFIG_ARM64_PSEUDO_NMI=y and pseudo-NMIs
are enabled at runtime, GICv3's gic_handle_irq() can leave DAIF and
ICC_PMR_EL1 in an unexpected state in some cases, breaking subsequent
usage of local_irq_enable() and resulting in softirqs being run with
IRQs erroneously masked (possibly resulting in deadlocks).
This can happen when an IRQ exception is taken from a context where
regular IRQs were unmasked, and either:
(1) ICC_IAR1_EL1 indicates a special INTID (e.g. as a result of an IRQ
being withdrawn since the IRQ exception was taken).
(2) ICC_IAR1_EL1 and ICC_RPR_EL1 indicate an NMI was acknowledged.
When an NMI is taken from a context where regular IRQs were masked,
there is no problem.
When CONFIG_ARM64_DEBUG_PRIORITY_MASKING=y, this can be detected with
perf, e.g.
| # ./perf record -a -g -e cycles:k ls -alR / > /dev/null 2>&1
| ------------[ cut here ]------------
| WARNING: CPU: 0 PID: 14 at arch/arm64/include/asm/irqflags.h:32 arch_local_irq_enable+0x4c/0x6c
| Modules linked in:
| CPU: 0 PID: 14 Comm: ksoftirqd/0 Not tainted 5.18.0-rc5-00004-g876c38e3d20b #12
| Hardware name: linux,dummy-virt (DT)
| pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_local_irq_enable+0x4c/0x6c
| lr : __do_softirq+0x110/0x5d8
| sp : ffff8000080bbbc0
| pmr_save: 000000f0
| x29: ffff8000080bbbc0 x28: ffff316ac3a6ca40 x27: 0000000000000000
| x26: 0000000000000000 x25: ffffa04611c06008 x24: ffffa04611c06008
| x23: 0000000040400005 x22: 0000000000000200 x21: ffff8000080bbe20
| x20: ffffa0460fe10320 x19: 0000000000000009 x18: 0000000000000000
| x17: ffff91252dfa9000 x16: ffff800008004000 x15: 0000000000004000
| x14: 0000000000000028 x13: ffffa0460fe17578 x12: ffffa0460fed4294
| x11: ffffa0460fedc168 x10: ffffffffffffff80 x9 : ffffa0460fe10a70
| x8 : ffffa0460fedc168 x7 : 000000000000b762 x6 : 00000000057c3bdf
| x5 : ffff8000080bbb18 x4 : 0000000000000000 x3 : 0000000000000001
| x2 : ffff91252dfa9000 x1 : 0000000000000060 x0 : 00000000000000f0
| Call trace:
| arch_local_irq_enable+0x4c/0x6c
| __irq_exit_rcu+0x180/0x1ac
| irq_exit_rcu+0x1c/0x44
| el1_interrupt+0x4c/0xe4
| el1h_64_irq_handler+0x18/0x24
| el1h_64_irq+0x74/0x78
| smpboot_thread_fn+0x68/0x2c0
| kthread+0x124/0x130
| ret_from_fork+0x10/0x20
| irq event stamp: 193241
| hardirqs last enabled at (193240): [<ffffa0460fe10a9c>] __do_softirq+0x10c/0x5d8
| hardirqs last disabled at (193241): [<ffffa0461102ffe4>] el1_dbg+0x24/0x90
| softirqs last enabled at (193234): [<ffffa0460fe10e00>] __do_softirq+0x470/0x5d8
| softirqs last disabled at (193239): [<ffffa0460fea9944>] __irq_exit_rcu+0x180/0x1ac
| ---[ end trace 0000000000000000 ]---
The necessary manipulation of DAIF and ICC_PMR_EL1 depends on the
interrupted context, but the structure of gic_handle_irq() makes this
also depend on whether the GIC reports an IRQ, NMI, or special INTID:
* When the interrupted context had regular IRQs masked (and hence the
interrupt must be an NMI), the entry code performs the NMI
entry/exit and gic_handle_irq() should return with DAIF and
ICC_PMR_EL1 unchanged.
This is handled correctly today.
* When the interrupted context had regular IRQs unmasked, the entry code
performs IRQ entry/exit, but expects gic_handle_irq() to always update
ICC_PMR_EL1 and DAIF.IF to unmask NMIs (but not regular IRQs) prior to
returning (which it must do prior to invoking any regular IRQ
handler).
This unbalanced calling convention is necessary because we don't know
whether an NMI has been taken until acknowledged by a read from
ICC_IAR1_EL1, and so we need to perform the read with NMI masked in
case an NMI has been taken (and needs to be handled with NMIs masked).
Unfortunately, this is not handled consistently:
- When ICC_IAR1_EL1 reports a special INTID, gic_handle_irq() returns
immediately without manipulating ICC_PMR_EL1 and DAIF.
- When RPR_EL1 indicates an NMI, gic_handle_irq() calls
gic_handle_nmi() to invoke the NMI handler, then returns without
manipulating ICC_PMR_EL1 and DAIF.
- For regular IRQs, gic_handle_irq() manipulates ICC_PMR_EL1 and DAIF
prior to invoking the IRQ handler.
There were related problems with special INTID handling in the past,
where if an exception was taken from a context with regular IRQs masked
and ICC_IAR_EL1 reported a special INTID, gic_handle_irq() would
erroneously unmask NMIs in NMI context permitted an unexpected nested
NMI. That case specifically was fixed by commit:
a97709f563a078e2 ("irqchip/gic-v3: Do not enable irqs when handling spurious interrups")
... but unfortunately that commit added an inverse problem, where if an
exception was taken from a context with regular IRQs *unmasked* and
ICC_IAR_EL1 reported a special INTID, gic_handle_irq() would erroneously
fail to unmask NMIs (and consequently regular IRQs could not be
unmasked during softirq processing). Before and after that commit, if an
NMI was taken from a context with regular IRQs unmasked gic_handle_irq()
would not unmask NMIs prior to returning, leading to the same problem
with softirq handling.
This patch fixes this by restructuring gic_handle_irq(), splitting it
into separate irqson/irqsoff helper functions which consistently perform
the DAIF + ICC_PMR1_EL1 manipulation based upon the interrupted context,
regardless of the event indicated by ICC_IAR1_EL1.
The special INTID handling is moved into the low-level IRQ/NMI handler
invocation helper functions, so that early returns don't prevent the
required manipulation of DAIF + ICC_PMR_EL1.
Fixes: f32c926651dcd168 ("irqchip/gic-v3: Handle pseudo-NMIs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 614ab80c96474682157cabb14f8c8602b3422e90)
Conflicts:
drivers/irqchip/irq-gic-v3.c
For two reasons. One, we have commit 0ab2a632cc5e ("REVISIT: ANDROID:
power: wakeup_reason: add an API to log wakeup reasons") which causes
a small conflict (easy to resolve). Two, upstream has commit
0953fb263714 ("irq: remove handle_domain_{irq,nmi}()"). We'd have to
pick a 17-patch series to take that patch and the difference is small.
BUG=b:197061987
TEST=pseudo NMIs still work
Change-Id: If024d8e05f48ca9a3864bee2aee3cff27c160c8c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Commit-Queue: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #57
Branch: chromeos-5.10
commit 43cd56150722b13f2004ee9352a7e3b995fa9693
Author: Douglas Anderson <dianders@chromium.org>
Date: Tue May 30 13:59:53 2023
BACKPORT: irqchip/gic-v3: Fix priority mask handling
When a kernel is built with CONFIG_ARM64_PSEUDO_NMI=y and pseudo-NMIs
are enabled at runtime, GICv3's gic_handle_irq() can leave DAIF and
ICC_PMR_EL1 in an unexpected state in some cases, breaking subsequent
usage of local_irq_enable() and resulting in softirqs being run with
IRQs erroneously masked (possibly resulting in deadlocks).
This can happen when an IRQ exception is taken from a context where
regular IRQs were unmasked, and either:
(1) ICC_IAR1_EL1 indicates a special INTID (e.g. as a result of an IRQ
being withdrawn since the IRQ exception was taken).
(2) ICC_IAR1_EL1 and ICC_RPR_EL1 indicate an NMI was acknowledged.
When an NMI is taken from a context where regular IRQs were masked,
there is no problem.
When CONFIG_ARM64_DEBUG_PRIORITY_MASKING=y, this can be detected with
perf, e.g.
| # ./perf record -a -g -e cycles:k ls -alR / > /dev/null 2>&1
| ------------[ cut here ]------------
| WARNING: CPU: 0 PID: 14 at arch/arm64/include/asm/irqflags.h:32 arch_local_irq_enable+0x4c/0x6c
| Modules linked in:
| CPU: 0 PID: 14 Comm: ksoftirqd/0 Not tainted 5.18.0-rc5-00004-g876c38e3d20b #12
| Hardware name: linux,dummy-virt (DT)
| pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_local_irq_enable+0x4c/0x6c
| lr : __do_softirq+0x110/0x5d8
| sp : ffff8000080bbbc0
| pmr_save: 000000f0
| x29: ffff8000080bbbc0 x28: ffff316ac3a6ca40 x27: 0000000000000000
| x26: 0000000000000000 x25: ffffa04611c06008 x24: ffffa04611c06008
| x23: 0000000040400005 x22: 0000000000000200 x21: ffff8000080bbe20
| x20: ffffa0460fe10320 x19: 0000000000000009 x18: 0000000000000000
| x17: ffff91252dfa9000 x16: ffff800008004000 x15: 0000000000004000
| x14: 0000000000000028 x13: ffffa0460fe17578 x12: ffffa0460fed4294
| x11: ffffa0460fedc168 x10: ffffffffffffff80 x9 : ffffa0460fe10a70
| x8 : ffffa0460fedc168 x7 : 000000000000b762 x6 : 00000000057c3bdf
| x5 : ffff8000080bbb18 x4 : 0000000000000000 x3 : 0000000000000001
| x2 : ffff91252dfa9000 x1 : 0000000000000060 x0 : 00000000000000f0
| Call trace:
| arch_local_irq_enable+0x4c/0x6c
| __irq_exit_rcu+0x180/0x1ac
| irq_exit_rcu+0x1c/0x44
| el1_interrupt+0x4c/0xe4
| el1h_64_irq_handler+0x18/0x24
| el1h_64_irq+0x74/0x78
| smpboot_thread_fn+0x68/0x2c0
| kthread+0x124/0x130
| ret_from_fork+0x10/0x20
| irq event stamp: 193241
| hardirqs last enabled at (193240): [<ffffa0460fe10a9c>] __do_softirq+0x10c/0x5d8
| hardirqs last disabled at (193241): [<ffffa0461102ffe4>] el1_dbg+0x24/0x90
| softirqs last enabled at (193234): [<ffffa0460fe10e00>] __do_softirq+0x470/0x5d8
| softirqs last disabled at (193239): [<ffffa0460fea9944>] __irq_exit_rcu+0x180/0x1ac
| ---[ end trace 0000000000000000 ]---
The necessary manipulation of DAIF and ICC_PMR_EL1 depends on the
interrupted context, but the structure of gic_handle_irq() makes this
also depend on whether the GIC reports an IRQ, NMI, or special INTID:
* When the interrupted context had regular IRQs masked (and hence the
interrupt must be an NMI), the entry code performs the NMI
entry/exit and gic_handle_irq() should return with DAIF and
ICC_PMR_EL1 unchanged.
This is handled correctly today.
* When the interrupted context had regular IRQs unmasked, the entry code
performs IRQ entry/exit, but expects gic_handle_irq() to always update
ICC_PMR_EL1 and DAIF.IF to unmask NMIs (but not regular IRQs) prior to
returning (which it must do prior to invoking any regular IRQ
handler).
This unbalanced calling convention is necessary because we don't know
whether an NMI has been taken until acknowledged by a read from
ICC_IAR1_EL1, and so we need to perform the read with NMI masked in
case an NMI has been taken (and needs to be handled with NMIs masked).
Unfortunately, this is not handled consistently:
- When ICC_IAR1_EL1 reports a special INTID, gic_handle_irq() returns
immediately without manipulating ICC_PMR_EL1 and DAIF.
- When RPR_EL1 indicates an NMI, gic_handle_irq() calls
gic_handle_nmi() to invoke the NMI handler, then returns without
manipulating ICC_PMR_EL1 and DAIF.
- For regular IRQs, gic_handle_irq() manipulates ICC_PMR_EL1 and DAIF
prior to invoking the IRQ handler.
There were related problems with special INTID handling in the past,
where if an exception was taken from a context with regular IRQs masked
and ICC_IAR_EL1 reported a special INTID, gic_handle_irq() would
erroneously unmask NMIs in NMI context permitted an unexpected nested
NMI. That case specifically was fixed by commit:
a97709f563a078e2 ("irqchip/gic-v3: Do not enable irqs when handling spurious interrups")
... but unfortunately that commit added an inverse problem, where if an
exception was taken from a context with regular IRQs *unmasked* and
ICC_IAR_EL1 reported a special INTID, gic_handle_irq() would erroneously
fail to unmask NMIs (and consequently regular IRQs could not be
unmasked during softirq processing). Before and after that commit, if an
NMI was taken from a context with regular IRQs unmasked gic_handle_irq()
would not unmask NMIs prior to returning, leading to the same problem
with softirq handling.
This patch fixes this by restructuring gic_handle_irq(), splitting it
into separate irqson/irqsoff helper functions which consistently perform
the DAIF + ICC_PMR1_EL1 manipulation based upon the interrupted context,
regardless of the event indicated by ICC_IAR1_EL1.
The special INTID handling is moved into the low-level IRQ/NMI handler
invocation helper functions, so that early returns don't prevent the
required manipulation of DAIF + ICC_PMR_EL1.
Fixes: f32c926651dcd168 ("irqchip/gic-v3: Handle pseudo-NMIs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 614ab80c96474682157cabb14f8c8602b3422e90)
Conflicts:
drivers/irqchip/irq-gic-v3.c
For two reasons. One, we have commit 0ab2a632cc5e ("REVISIT: ANDROID:
power: wakeup_reason: add an API to log wakeup reasons") which causes
a small conflict (easy to resolve). Two, upstream has commit
0953fb263714 ("irq: remove handle_domain_{irq,nmi}()"). We'd have to
pick a 17-patch series to take that patch and the difference is small.
BUG=b:197061987
TEST=pseudo NMIs still work
Change-Id: If024d8e05f48ca9a3864bee2aee3cff27c160c8c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #58
Branch: chromeos-5.10
commit 7d3e8ae561a35a739bf149dcc9b49e4c6062a506
Author: Mark Rutland <mark.rutland@arm.com>
Date: Fri May 13 14:30:37 2022
UPSTREAM: irqchip/gic-v3: Refactor ISB + EOIR at ack time
There are cases where a context synchronization event is necessary
between an IRQ being raised and being handled, and there are races such
that we cannot rely upon the exception entry being subsequent to the
interrupt being raised. To fix this, we place an ISB between a read of
IAR and the subsequent invocation of an IRQ handler.
When EOI mode 1 is in use, we need to EOI an interrupt prior to invoking
its handler, and we have a write to EOIR for this. As this write to EOIR
requires an ISB, and this is provided by the gic_write_eoir() helper, we
omit the usual ISB in this case, with the logic being:
| if (static_branch_likely(&supports_deactivate_key))
| gic_write_eoir(irqnr);
| else
| isb();
This is somewhat opaque, and it would be a little clearer if there were
an unconditional ISB, with only the write to EOIR being conditional,
e.g.
| if (static_branch_likely(&supports_deactivate_key))
| write_gicreg(irqnr, ICC_EOIR1_EL1);
|
| isb();
This patch rewrites the code that way, with this logic factored into a
new helper function with comments explaining what the ISB is for, as
were originally laid out in commit:
39a06b67c2c1256b ("irqchip/gic: Ensure we have an ISB between ack and ->handle_irq")
Note that since then, we removed the IAR polling in commit:
342677d70ab92142 ("irqchip/gic-v3: Remove acknowledge loop")
... which removed one of the two race conditions.
For consistency, other portions of the driver are made to manipulate
EOIR using write_gicreg() and explcit ISBs, and the gic_write_eoir()
helper function is removed.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit 6efb50923771f392122f5ce69dfc43b08f16e449)
BUG=b:197061987
TEST=pseudo NMIs still work
Change-Id: Iff515a713d32d559a7c7e687c372f08d01576700
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M arch/arm/include/asm/arch_gicv3.h
M arch/arm64/include/asm/arch_gicv3.h
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #59
Branch: chromeos-5.10
commit 161bd18ecce8f7414743afd985624ca80ba4c8cf
Author: Mark Rutland <mark.rutland@arm.com>
Date: Fri May 13 14:30:36 2022
UPSTREAM: irqchip/gic-v3: Ensure pseudo-NMIs have an ISB between ack and handling
There are cases where a context synchronization event is necessary
between an IRQ being raised and being handled, and there are races such
that we cannot rely upon the exception entry being subsequent to the
interrupt being raised.
We identified and fixes this for regular IRQs in commit:
39a06b67c2c1256b ("irqchip/gic: Ensure we have an ISB between ack and ->handle_irq")
Unfortunately, we forgot to do the same for psuedo-NMIs when support for
those was added in commit:
f32c926651dcd168 ("irqchip/gic-v3: Handle pseudo-NMIs")
Which means that when pseudo-NMIs are used for PMU support, we'll hit
the same problem.
Apply the same fix as for regular IRQs. Note that when EOI mode 1 is in
use, the call to gic_write_eoir() will provide an ISB.
Fixes: f32c926651dcd168 ("irqchip/gic-v3: Handle pseudo-NMIs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link:
(cherry picked from commit adf14453d2c037ab529040c1186ea32e277e783a)
BUG=b:197061987
TEST=pseudo NMIs still work
Change-Id: I87dab8a11de106a7beed3b8bb6fd467e4f3dd8b3
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M drivers/irqchip/irq-gic-v3.c
di...@google.com <di...@google.com> #60
I've sent out v9 of this patch series:
We also landed all the upstream fixes for pseudo-NMI in chromeos-5.10 and chromeos-5.15. chromeos-6.1 already has them.
Updated plan:
- Keep track of Mediatek in
b/281831288 - Figure out if there is anything else actually broken with pseudo NMI. See
.my ping to Mark Rutland - Figure out how to get the pseudo-NMI backtrace series landed upstream.
- Re-confirm Chen-Yu's tests earlier that show that turning on pseudo-NMI isn't causing any big performance regressions on all of our hardware.
- Turn on in arm64 config for ChromeOS 5.10, 5.15, and 6.1 and add the magic kernel command line argument to our build scripts.
Then, I think we're done.
di...@google.com <di...@google.com> #61
OK, I ran hackbench on trogdor based somewhat on Chen-Yu's methods in
I've attached my hacky script that I used. On my machine, the 3 branches (230719-nmi-test-trogdor-config-disabled
, 230719-nmi-test-trogdor-config-enabled
, and 230719-nmi-test-trogdor-forced-on
) represent how many of the CLs in the chain are applied. The "disable" is just before turning on the config. The "enabled" is after turning on the config but without the command line argument. The "forced on" is where I just force the command line argument on in the kernel.
In order to rule out any changes like thermal pressure changing over the course of the test, I did my usual interleaving of the tests (test all 3 configs, then test all 3 configs again, etc). Raw results are attached.
Here are the results sorted by the config:
$ grep -A2 "230719-nmi-test-trogdor-config-disabled" /b/tip/tmp/230719-pnmi | grep mean
min=24.4, max=25.0, mean=24.6
min=23.8, max=26.5, mean=24.5
min=23.9, max=24.7, mean=24.2
min=24.3, max=24.8, mean=24.5
min=23.9, max=24.4, mean=24.2
$ grep -A2 "230719-nmi-test-trogdor-config-enabled" /b/tip/tmp/230719-pnmi | grep mean
min=23.8, max=24.6, mean=24.1
min=23.9, max=28.3, mean=24.6
min=24.0, max=25.5, mean=24.3
min=23.7, max=24.9, mean=24.2
min=23.7, max=24.9, mean=24.1
$ grep -A2 "230719-nmi-test-trogdor-forced-on" /b/tip/tmp/230719-pnmi | grep mean
min=25.0, max=25.8, mean=25.4
min=25.2, max=26.0, mean=25.5
min=24.8, max=25.6, mean=25.3
min=24.7, max=25.9, mean=25.2
min=25.1, max=25.6, mean=25.4
From my tests, you can see that turning on the config really didn't do too much. However, actually enabling pseudoNMI did cause a slight regression in hackbench (lower numbers are better) where it gave ~4% worse results. I would believe that this is probably OK for a microbenchmark. I'll try to double-confirm against ToT Linux and also see if any real-world benchmarks are affected.
di...@google.com <di...@google.com> #62
I did 3 loops against vanilla kernel 6.4 (with my patches applied) using the same methodology.
$ grep -A2 config-disabled /b/tip/tmp/230720-pnmi-mainline | grep mean
min=25.0, max=26.9, mean=25.5
min=25.1, max=27.4, mean=25.6
min=24.9, max=26.9, mean=25.4
$ grep -A2 config-enabled /b/tip/tmp/230720-pnmi-mainline | grep mean
min=24.8, max=29.3, mean=25.5
min=24.8, max=25.5, mean=25.1
min=25.0, max=26.8, mean=25.4
$ grep -A2 forced-on /b/tip/tmp/230720-pnmi-mainline | grep mean
min=25.9, max=27.6, mean=26.4
min=25.8, max=26.5, mean=26.1
min=26.0, max=26.5, mean=26.2
Baseline results were slower by about 4% and then, just like against chromeos-5.15, we lost another 4% enabling pseudo-NMI. The baseline results might be slower due to other config options. Since this was a vanilla kernel I built it with the fallback config instead of the normal ChromeOS config.
I also did 3 loops against a 5.15 kernel but turned off "auto FDO" by adding "-kernel_afdo -kern_arm_afdo" to my USE flags. In that case baseline got even worse and we again got ~4% worse when we fully enabled pseudo-NMI.
$ grep -A2 config-disabled /b/tip/tmp/230720-pnmi-noafdo | grep mean
min=26.6, max=27.4, mean=27.1
min=26.6, max=27.8, mean=27.2
min=26.8, max=27.7, mean=27.2
$ grep -A2 config-enabled /b/tip/tmp/230720-pnmi-noafdo | grep mean
min=27.1, max=28.2, mean=27.6
min=27.0, max=28.1, mean=27.4
min=26.8, max=29.5, mean=27.5
$ grep -A2 forced /b/tip/tmp/230720-pnmi-noafdo | grep mean
min=27.9, max=29.2, mean=28.5
min=28.3, max=29.0, mean=28.6
min=28.1, max=29.4, mean=28.5
di...@google.com <di...@google.com> #63
I managed to get speedometer running with a similar test and it made no measurable difference. ...so I think the 4% microbenchmark hackbench regression here is not something to be overly concerned about. While speedometer isn't the end-all be-all of benchmarks, I think we can be confident that the real-world effect of enabling this won't be large.
Hacky script and results attached. It looks like there was one failure to run the test at the very end, but otherwise data is pretty complete. To make the data easy to visualize, I graphed it in a spreadsheet and also took the mean of the data. The mean of the data after 25 runs was nearly identical. The graph also makes it easy to see that there's no real difference.
ap...@google.com <ap...@google.com> #64
Branch: main
commit 273ff1f613fb66d7ee9f9b40af0dbe42dc77c67d
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Jul 26 15:11:40 2023
mediatek_defconfig: Turn on CONFIG_ARM64_PSEUDO_NMI
We haven't turned this on in our main config yet, but put it in the
fallback config to facilitate debugging there.
NOTE: this is a little iffier on Mediatek than on Qualcomm given the
firmware bugs we found on Mediatek (
be OK. Specifically:
- As long as you have a new enough kernel you should have the
devicetree preventing this from breaking unless you have the
firmware fix.
- In order to use this, you need the `irqchip.gicv3_pseudo_nmi=1`
kernel command line, so hopefully people not "in the know" won't
accidentally shoot themselves in the foot even if they're testing
older kernels.
BUG=b:197061987
TEST=Use pseudo NMIs w/ lockup detector
Change-Id: Id5e89179214f1c359b4341120c26bca168534f4d
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M eclass/cros-kernel/mediatek_defconfig
ap...@google.com <ap...@google.com> #65
Branch: main
commit 7db1caf4f6140b0512c0054219293ba57bdd7096
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Apr 19 13:08:05 2023
qualcomm_defconfig: Turn on CONFIG_ARM64_PSEUDO_NMI
We haven't turned this on in our main config yet, but put it in the
fallback config to facilitate debugging there.
NOTE: without the `irqchip.gicv3_pseudo_nmi=1` kernel command line
parameter, this doesn't actually do a whole lot other than enable the
code paths.
BUG=b:197061987
TEST=Use pseudo NMIs w/ lockup detector
Change-Id: I34a3e4b52d050052229b9627b5732aec65ff703b
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M eclass/cros-kernel/qualcomm_defconfig
di...@google.com <di...@google.com> #66
FWIW: these landed upstream (horay!). I refreshed the picks against 5.15 and 6.1, but they can't quite land yet. The bots found that when you enable CONFIG_ARM64_ACPI_PARKING_PROTOCOL
(which we don't normally enable) that they don't compile in our tree. That's because the patch uses arch_smp_send_reschedule()
which doesn't exist in our tree. It came from commit 4c8c3c7f70a6
("treewide: Trace IPIs sent via smp_send_reschedule()"). That made me wonder why we're using arch_smp_send_reschedule()
directly instead of smp_send_reschedule()
, so I smp_send_reschedule()
then that would be cleanest, IMO. Then we can just pick the set and not worry about adding the tracing to our tree.
For now, juggling:
bugjuggler: wait 1 week
Also: an update on plans. As of now, I probably won't try to backport these to 5.10. Specifically, the current first patch of the series, 6abbd6988971
("irqchip/gic, gic-v3: Make SGIs use handle_percpu_devid_irq()") which doesn't exist in our 5.10. We could backport it, but unless we're actually going to roll firmware out for the various Mediatek boards then it's probably not worth it. Mediatek boards will get this feature when they move to 5.15+.
bu...@google.com <bu...@google.com> #67
we...@google.com <we...@google.com> #68
Running next on MediaTek devices without firmware fixed, I get the following warnings, one per CPU core.
prepare_percpu_nmi called for a non-NMI interrupt: irq 4
WARNING: CPU: 7 PID: 0 at kernel/irq/manage.c:2742 prepare_percpu_nmi+0x19c/0x1c0
Modules linked in:
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G W 6.6.0-rc3-next-20230928-08558-g358342cb3833 #31 7514c590cecd3e60aced4f5aa68fe1342e34a074
Hardware name: Google Hayato rev1 (DT)
pstate: 604001c9 (nZCv dAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : prepare_percpu_nmi+0x19c/0x1c0
lr : prepare_percpu_nmi+0x19c/0x1c0
sp : ffffffc080477d90
pmr_save: 000000f0
x29: ffffffc080477d90 x28: 0000000000000000 x27: 0000000000000000
x26: 0000000000000000 x25: ffffffe09e2ae000 x24: ffffffe09e2ae428
x23: ffffffe09f411f80 x22: 1ffffff81008efba x21: 0000000000000004
x20: 0000000000000001 x19: ffffff80c0074000 x18: 0000000000000000
x17: 000000040044ffff x16: 005000f2b5593519 x15: 0000000000000000
x14: 1ffffff81008ef0e x13: 0000000041b58ab3 x12: ffffffbc13e4b389
x11: 1ffffffc13e4b388 x10: ffffffbc13e4b388 x9 : ffffffe09c5c0cc4
x8 : 00000043ec1b4c78 x7 : ffffffe09f259c40 x6 : 0000000000000001
x5 : ffffffe09f259c40 x4 : 0000000000000000 x3 : dfffffc000000000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80c0af0000
Call trace:
prepare_percpu_nmi+0x19c/0x1c0
ipi_setup.isra.0+0x44/0xc8
secondary_start_kernel+0x148/0x2b0
__secondary_switched+0xb8/0xc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffe09c4d9900>] copy_process+0xcd0/0x2880
softirqs last enabled at (0): [<ffffffe09c4d9900>] copy_process+0xcd0/0x2880
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
Maybe we could do something about this?
we...@google.com <we...@google.com> #69
BTW, I see other teams are backporting pseudo-NMI for their uses, and I see some latency / performance regression measurements in
di...@google.com <di...@google.com> #70
Running next on MediaTek devices without firmware fixed, I get the following warnings, one per CPU core.
Ah, dang. It didn't do this before I rejiggered all the code for Mark's feedback, but it clearly does now. I'll send up a patch ASAP.
BTW, I see other teams are backporting pseudo-NMI for their uses, and I see some latency / performance regression measurements in
Yeah, I was aware of some of these. We should certainly keep an eye on this, but I think we know that some codepaths will be slower with pseudo-NMI. IMO we should mitigate these as much as possible and then enable it anyway. The extra debuggability is worth it. I'm happy to have a discussion about this if people disagree, though. I'm also planning to try to talk with other kernel teams at Google about this in early November.
di...@google.com <di...@google.com> #71
we...@google.com <we...@google.com> #72
I'm also planning to try to talk with other kernel teams at Google about this in early November.
It would make a great topic for Kernel Exchange.
bu...@google.com <bu...@google.com>
di...@google.com <di...@google.com> #73
It would make a great topic for Kernel Exchange.
Yup, that's the plan and what I was alluding to. I've already submitted a lightning talk to hopefully get people talking about it.
Still working on stuff upstream. Hopefully we'll get things all resolved sooner, but schedule a nag for 2 weeks just in case:
bugjuggler: wait 2 weeks
bu...@google.com <bu...@google.com> #74
di...@google.com <di...@google.com> #75
Fixes have landed upstream and I've refreshed the picks to v6.1 and v5.15.
ap...@google.com <ap...@google.com> #76
Branch: chromeos-6.1
commit 34ffe7f15cdfc34b121855e67e50b2f7d49d16db
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed May 10 07:21:36 2023
CHROMIUM: config: Turn on CONFIG_ARM64_PSEUDO_NMI
We want to use this to get backtraces when CPUs are locked up.
This is:
echo 'CONFIG_ARM64_PSEUDO_NMI=y' >> \
chromeos/config/chromeos/arm64/common.config
./chromeos/scripts/kernelconfig olddefconfig
When doing this we also need to explicitly turn on
"CONFIG_HARDLOCKUP_DETECTOR_PREFER_BUDDY". We got the buddy detector
by default before, but with "CONFIG_ARM64_PSEUDO_NMI" the perf
detector becomes default and we don't want that. Aside from resource
usage of the perf detector, on arm64 pseudo-NMI isn't guaranteed on
every SoC and the buddy detector falls back to more functionality than
the perf one.
NOTE: this needs to work together with adding
`irqchip.gicv3_pseudo_nmi=1` to the kernel command line.
BUG=b:197061987
TEST=w/ supporting patches can now do backtrace hardlocked CPUs
Change-Id: Ie8c03caba651cfb52eb0af17bfec7610f91fcf44
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M chromeos/config/chromeos/arm64/common.config
ap...@google.com <ap...@google.com> #77
Branch: chromeos-6.1
commit 3c05acf474383361c8b84e448ca472f1a7ce8ec3
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:03:01 2023
FROMGIT: arm64: kgdb: Implement kgdb_roundup_cpus() to enable pseudo-NMI roundup
Up until now we've been using the generic (weak) implementation for
kgdb_roundup_cpus() when using kgdb on arm64. Let's move to a custom
one. The advantage here is that, when pseudo-NMI is enabled on a
device, we'll be able to round up CPUs using pseudo-NMI. This allows
us to debug CPUs that are stuck with interrupts disabled. If
pseudo-NMIs are not enabled then we'll fallback to just using an IPI,
which is still slightly better than the generic implementation since
it avoids the potential situation described in the generic
kgdb_call_nmi_hook().
Co-developed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 2f5cd0c7ffde0ec7779f27e5c4ed30e131b66393
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I2ef26d1b3bfbed2d10a281942b0da7d9854de05e
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #78
Branch: chromeos-6.1
commit cb93cf2a2a9d75e61a6c84e1e0f0002e41968db8
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:03:00 2023
FROMGIT: arm64: smp: IPI_CPU_STOP and IPI_CPU_CRASH_STOP should try for NMI
There's no reason why IPI_CPU_STOP and IPI_CPU_CRASH_STOP can't be
handled as NMI. They are very simple and everything in them is
NMI-safe. Mark them as things to use NMI for if NMI is available.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit d7402513c935ad87413b01aa51a7ada0ad2f0163
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Ifadbfd45b22c52edcb499034dd4783d096343260
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #79
Branch: chromeos-6.1
commit 64fcda20b3df0467bcd98590a08809f185ef9936
Author: Mark Rutland <mark.rutland@arm.com>
Date: Mon Oct 02 18:00:36 2023
FROMGIT: arm64: smp: avoid NMI IPIs with broken MediaTek FW
Some MediaTek devices have broken firmware which corrupts some GICR
registers behind the back of the OS, and pseudo-NMIs cannot be used on
these devices. For more details see commit:
44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
We did not take this problem into account in commit:
331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Since that commit arm64's SMP code will try to setup some IPIs as
pseudo-NMIs, even on systems with broken FW. The GICv3 code will
(rightly) reject attempts to request interrupts as pseudo-NMIs,
resulting in boot-time failures.
Avoid the problem by taking the broken FW into account when deciding to
request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
named "supports_pseudo_nmis" which is false on systems with broken FW,
and we can consult this within ipi_should_be_nmi().
Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Reported-by: Chen-Yu Tsai <wenst@chromium.org>
Closes:
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit a07a594152173a3dd3bdd12fc7d73dbba54cdbca
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I695adbdda561ab9476f4b6d3732a920c7b345579
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M arch/arm64/kernel/smp.c
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #80
Branch: chromeos-6.1
commit ec862cf6332903153e48cb29d00a9596ac7c1fbd
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:02:59 2023
FROMGIT: arm64: smp: Add arch support for backtrace using pseudo-NMI
Enable arch_trigger_cpumask_backtrace() support on arm64. This enables
things much like they are enabled on arm32 (including some of the
funky logic around NR_IPI, nr_ipi, and MAX_IPI) but with the
difference that, unlike arm32, we'll try to enable the backtrace to
use pseudo-NMI.
NOTE: this patch is a squash of the little bit of code adding the
ability to mark an IPI to try to use pseudo-NMI plus the little bit of
code to hook things up for kgdb. This approach was decided upon in the
discussion of v9 [1].
This patch depends on commit 8d539b84f1e3 ("nmi_backtrace: allow
excluding an arbitrary CPU") since that commit changed the prototype
of arch_trigger_cpumask_backtrace(), which this patch implements.
[1]
Co-developed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Co-developed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 331a1b3a836c0f38165dcec168c0a03b93cf0c17
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Cq-Depend: chromium:4919748
Change-Id: Ie6c132b96ebbbcddbf6954b9469ed40a6960343c
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/include/asm/irq.h
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #81
Branch: chromeos-6.1
commit 5062c3c8a1edca095c2bd3180307cec9af6ee0c1
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Oct 02 09:45:30 2023
FROMGIT: arm64: smp: Don't directly call arch_smp_send_reschedule() for wakeup
In commit 2b2d0a7a96ab ("arm64: smp: Remove dedicated wakeup IPI") we
started using a scheduler IPI to avoid a dedicated reschedule. When we
did this, we used arch_smp_send_reschedule() directly rather than
calling smp_send_reschedule(). The only difference is that calling
arch_smp_send_reschedule() directly avoids tracing. Presumably we
_don't_ want to avoid tracing here, so switch to
smp_send_reschedule().
Fixes: 2b2d0a7a96ab ("arm64: smp: Remove dedicated wakeup IPI")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit ef31b8ce313eaf891bf705d5db754e549351816f
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I020dc32d3ec71532a9f7a461e6ed97a48df767b6
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #82
Branch: chromeos-6.1
commit 5d4a679e90079c952883ff57d2df0feb3734b031
Author: Mark Rutland <mark.rutland@arm.com>
Date: Wed Sep 06 09:02:58 2023
FROMGIT: arm64: smp: Remove dedicated wakeup IPI
To enable NMI backtrace and KGDB's NMI cpu roundup, we need to free up
at least one dedicated IPI.
On arm64 the IPI_WAKEUP IPI is only used for the ACPI parking protocol,
which itself is only used on some very early ARMv8 systems which
couldn't implement PSCI.
Remove the IPI_WAKEUP IPI, and rely on the IPI_RESCHEDULE IPI to wake
CPUs from the parked state. This will cause a tiny amonut of redundant
work to check the thread flags, but this is miniscule in relation to the
cost of taking and handling the IPI in the first place. We can safely
handle redundant IPI_RESCHEDULE IPIs, so there should be no functional
impact as a result of this change.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 2b2d0a7a96ab36ed6d963e29b6211b184ef81596
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Cq-Depend: chromium:4919749
Change-Id: I7209db47ef8ec151d3de61f59005bbc59fe8f113
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/include/asm/smp.h
M arch/arm64/kernel/acpi_parking_protocol.c
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #83
Branch: chromeos-6.1
commit 66bec32a981c80b43fbf5132d084678077d8da28
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:02:57 2023
FROMGIT: arm64: idle: Tag the arm64 idle functions as __cpuidle
As per the (somewhat recent) comment before the definition of
`__cpuidle`, the tag is like `noinstr` but also marks a function so it
can be identified by cpu_in_idle(). Let's add these markings to arm64
cpuidle functions
With this change we get useful backtraces like:
NMI backtrace for cpu N skipped: idling at cpu_do_idle+0x94/0x98
instead of useless backtraces when dumping all processors using
nmi_cpu_backtrace().
NOTE: this patch won't make cpu_in_idle() work perfectly for arm64,
but it doesn't hurt and does catch some cases. Specifically an example
that wasn't caught in my testing looked like this:
gic_cpu_sys_reg_init+0x1f8/0x314
gic_cpu_pm_notifier+0x40/0x78
raw_notifier_call_chain+0x5c/0x134
cpu_pm_notify+0x38/0x64
cpu_pm_exit+0x20/0x2c
psci_enter_idle_state+0x48/0x70
cpuidle_enter_state+0xb8/0x260
cpuidle_enter+0x44/0x5c
do_idle+0x188/0x30c
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit d0c14a7d36f035aeae1bdd6f4afc6488400ed5cf
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I4baba13e220bdd24d11400c67f137c35f07f82c7
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/kernel/idle.c
ap...@google.com <ap...@google.com> #84
Branch: chromeos-6.1
commit 4761f378c9188b73d6fc4703b43d60e271dc7e16
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:02:56 2023
FROMGIT: irqchip/gic-v3: Enable support for SGIs to act as NMIs
As of commit 6abbd6988971 ("irqchip/gic, gic-v3: Make SGIs use
handle_percpu_devid_irq()") SGIs are treated the same as PPIs/EPPIs
and use handle_percpu_devid_irq() by default. Unfortunately,
handle_percpu_devid_irq() isn't NMI safe, and so to run in an NMI
context those should use handle_percpu_devid_fasteoi_nmi().
In order to accomplish this, we just have to make room for SGIs in the
array of refcounts that keeps track of which interrupts are set as
NMI. We also rename the array and create a new indexing scheme that
accounts for SGIs.
Also, enable NMI support prior to gic_smp_init() as allocation of SGIs
as IRQs/NMIs happen as part of this routine.
Co-developed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit a02026bf9da13cd44fb444857d5aebc934e1af5a
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I1223c11c88937bd0cbd9b086d4ef216985797302
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #85
Branch: chromeos-6.1
commit e6e4fe95d4bafe1fb2a546309a0831ea8a91c01e
Author: Douglas Anderson <dianders@chromium.org>
Date: Thu Jun 01 14:31:50 2023
UPSTREAM: kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB
To save architectures from needing to wrap the call in #ifdefs, add a
stub no-op version of kgdb_nmicallback(), which returns 1 if it didn't
handle anything.
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
(cherry picked from commit 8117f948f12bc559edf40916e7693512c8c9a50b)
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Ia3aeac89bb6751b682237e76e5ba594318e4b1aa
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M include/linux/kgdb.h
ap...@google.com <ap...@google.com> #86
Branch: chromeos-5.15
commit d3770f2bea1f090ec42bda1483bbaa09ef96dbe1
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed May 10 07:21:36 2023
CHROMIUM: config: Turn on CONFIG_ARM64_PSEUDO_NMI
We want to use this to get backtraces when CPUs are locked up.
This is:
echo 'CONFIG_ARM64_PSEUDO_NMI=y' >> \
chromeos/config/chromeos/arm64/common.config
./chromeos/scripts/kernelconfig olddefconfig
NOTE: this needs to work together with adding
`irqchip.gicv3_pseudo_nmi=1` to the kernel command line.
BUG=b:197061987
TEST=w/ supporting patches can now do backtrace hardlocked CPUs
Change-Id: Ie8c03caba651cfb52eb0af17bfec7610f91fcf44
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M chromeos/config/chromeos/arm64/common.config
ap...@google.com <ap...@google.com> #87
Branch: chromeos-5.15
commit c28b94b561593fed02eb4debef3ca9efba67ac92
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:03:01 2023
FROMGIT: arm64: kgdb: Implement kgdb_roundup_cpus() to enable pseudo-NMI roundup
Up until now we've been using the generic (weak) implementation for
kgdb_roundup_cpus() when using kgdb on arm64. Let's move to a custom
one. The advantage here is that, when pseudo-NMI is enabled on a
device, we'll be able to round up CPUs using pseudo-NMI. This allows
us to debug CPUs that are stuck with interrupts disabled. If
pseudo-NMIs are not enabled then we'll fallback to just using an IPI,
which is still slightly better than the generic implementation since
it avoids the potential situation described in the generic
kgdb_call_nmi_hook().
Co-developed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 2f5cd0c7ffde0ec7779f27e5c4ed30e131b66393
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I2ef26d1b3bfbed2d10a281942b0da7d9854de05e
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #88
Branch: chromeos-5.15
commit 09382ac2165d24d0e403865af84a35b4bf7318c5
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:03:00 2023
FROMGIT: arm64: smp: IPI_CPU_STOP and IPI_CPU_CRASH_STOP should try for NMI
There's no reason why IPI_CPU_STOP and IPI_CPU_CRASH_STOP can't be
handled as NMI. They are very simple and everything in them is
NMI-safe. Mark them as things to use NMI for if NMI is available.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit d7402513c935ad87413b01aa51a7ada0ad2f0163
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Ifadbfd45b22c52edcb499034dd4783d096343260
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #89
Branch: chromeos-5.15
commit 5cf02c20b7e633a4168ae77c3aefde7020f86658
Author: Mark Rutland <mark.rutland@arm.com>
Date: Mon Oct 02 18:00:36 2023
FROMGIT: arm64: smp: avoid NMI IPIs with broken MediaTek FW
Some MediaTek devices have broken firmware which corrupts some GICR
registers behind the back of the OS, and pseudo-NMIs cannot be used on
these devices. For more details see commit:
44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
We did not take this problem into account in commit:
331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Since that commit arm64's SMP code will try to setup some IPIs as
pseudo-NMIs, even on systems with broken FW. The GICv3 code will
(rightly) reject attempts to request interrupts as pseudo-NMIs,
resulting in boot-time failures.
Avoid the problem by taking the broken FW into account when deciding to
request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
named "supports_pseudo_nmis" which is false on systems with broken FW,
and we can consult this within ipi_should_be_nmi().
Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Reported-by: Chen-Yu Tsai <wenst@chromium.org>
Closes:
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit a07a594152173a3dd3bdd12fc7d73dbba54cdbca
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I695adbdda561ab9476f4b6d3732a920c7b345579
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M arch/arm64/kernel/smp.c
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #90
Branch: chromeos-5.15
commit 76f262d75f003e8b80eb9d4443e63d91aededd8d
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:02:59 2023
BACKPORT: FROMGIT: arm64: smp: Add arch support for backtrace using pseudo-NMI
Enable arch_trigger_cpumask_backtrace() support on arm64. This enables
things much like they are enabled on arm32 (including some of the
funky logic around NR_IPI, nr_ipi, and MAX_IPI) but with the
difference that, unlike arm32, we'll try to enable the backtrace to
use pseudo-NMI.
NOTE: this patch is a squash of the little bit of code adding the
ability to mark an IPI to try to use pseudo-NMI plus the little bit of
code to hook things up for kgdb. This approach was decided upon in the
discussion of v9 [1].
This patch depends on commit 8d539b84f1e3 ("nmi_backtrace: allow
excluding an arbitrary CPU") since that commit changed the prototype
of arch_trigger_cpumask_backtrace(), which this patch implements.
[1]
Co-developed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Co-developed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 331a1b3a836c0f38165dcec168c0a03b93cf0c17
git://
Conflicts:
arch/arm64/include/asm/irq.h
arch/arm64/kernel/smp.c
...technially they didn't cause conflict with the "git cherry-pick",
but those files needed to be manually modified because we don't have
commit 8d539b84f1e3 ("nmi_backtrace: allow excluding an arbitrary
CPU"). Picking that gets into picking the whole pile of cleanup
patches relating to the upstream of the buddy hardlockup
detector. That's a lot of work for the 5.15 kernel when the downstream
detector works fine there. This backport is trivial (just change
"exclude_cpu" to "exclude_self").
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Cq-Depend: chromium:4919746
Change-Id: Ie6c132b96ebbbcddbf6954b9469ed40a6960343c
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M arch/arm64/include/asm/irq.h
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #91
Branch: chromeos-5.15
commit 54c3dde1afd966db4d996c2b12b8f28e759fbf01
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Oct 02 09:45:30 2023
FROMGIT: arm64: smp: Don't directly call arch_smp_send_reschedule() for wakeup
In commit 2b2d0a7a96ab ("arm64: smp: Remove dedicated wakeup IPI") we
started using a scheduler IPI to avoid a dedicated reschedule. When we
did this, we used arch_smp_send_reschedule() directly rather than
calling smp_send_reschedule(). The only difference is that calling
arch_smp_send_reschedule() directly avoids tracing. Presumably we
_don't_ want to avoid tracing here, so switch to
smp_send_reschedule().
Fixes: 2b2d0a7a96ab ("arm64: smp: Remove dedicated wakeup IPI")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit ef31b8ce313eaf891bf705d5db754e549351816f
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I020dc32d3ec71532a9f7a461e6ed97a48df767b6
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #92
Branch: chromeos-5.15
commit fa7b1519c325a1be6a4e6348df9e80f06597b9f4
Author: Mark Rutland <mark.rutland@arm.com>
Date: Wed Sep 06 09:02:58 2023
FROMGIT: arm64: smp: Remove dedicated wakeup IPI
To enable NMI backtrace and KGDB's NMI cpu roundup, we need to free up
at least one dedicated IPI.
On arm64 the IPI_WAKEUP IPI is only used for the ACPI parking protocol,
which itself is only used on some very early ARMv8 systems which
couldn't implement PSCI.
Remove the IPI_WAKEUP IPI, and rely on the IPI_RESCHEDULE IPI to wake
CPUs from the parked state. This will cause a tiny amonut of redundant
work to check the thread flags, but this is miniscule in relation to the
cost of taking and handling the IPI in the first place. We can safely
handle redundant IPI_RESCHEDULE IPIs, so there should be no functional
impact as a result of this change.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 2b2d0a7a96ab36ed6d963e29b6211b184ef81596
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Cq-Depend: chromium:4919747
Change-Id: I7209db47ef8ec151d3de61f59005bbc59fe8f113
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M arch/arm64/include/asm/smp.h
M arch/arm64/kernel/acpi_parking_protocol.c
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #93
Branch: chromeos-5.15
commit 89b2e3d5c3b43ed5f82cc0688f7287c35b4796f1
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:02:57 2023
FROMGIT: arm64: idle: Tag the arm64 idle functions as __cpuidle
As per the (somewhat recent) comment before the definition of
`__cpuidle`, the tag is like `noinstr` but also marks a function so it
can be identified by cpu_in_idle(). Let's add these markings to arm64
cpuidle functions
With this change we get useful backtraces like:
NMI backtrace for cpu N skipped: idling at cpu_do_idle+0x94/0x98
instead of useless backtraces when dumping all processors using
nmi_cpu_backtrace().
NOTE: this patch won't make cpu_in_idle() work perfectly for arm64,
but it doesn't hurt and does catch some cases. Specifically an example
that wasn't caught in my testing looked like this:
gic_cpu_sys_reg_init+0x1f8/0x314
gic_cpu_pm_notifier+0x40/0x78
raw_notifier_call_chain+0x5c/0x134
cpu_pm_notify+0x38/0x64
cpu_pm_exit+0x20/0x2c
psci_enter_idle_state+0x48/0x70
cpuidle_enter_state+0xb8/0x260
cpuidle_enter+0x44/0x5c
do_idle+0x188/0x30c
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit d0c14a7d36f035aeae1bdd6f4afc6488400ed5cf
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I4baba13e220bdd24d11400c67f137c35f07f82c7
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M arch/arm64/kernel/idle.c
ap...@google.com <ap...@google.com> #94
Branch: chromeos-5.15
commit 043d4e4f9cfa6b4a290a95a7b0b5c1ba3721903b
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Sep 06 09:02:56 2023
FROMGIT: irqchip/gic-v3: Enable support for SGIs to act as NMIs
As of commit 6abbd6988971 ("irqchip/gic, gic-v3: Make SGIs use
handle_percpu_devid_irq()") SGIs are treated the same as PPIs/EPPIs
and use handle_percpu_devid_irq() by default. Unfortunately,
handle_percpu_devid_irq() isn't NMI safe, and so to run in an NMI
context those should use handle_percpu_devid_fasteoi_nmi().
In order to accomplish this, we just have to make room for SGIs in the
array of refcounts that keeps track of which interrupts are set as
NMI. We also rename the array and create a new indexing scheme that
accounts for SGIs.
Also, enable NMI support prior to gic_smp_init() as allocation of SGIs
as IRQs/NMIs happen as part of this routine.
Co-developed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit a02026bf9da13cd44fb444857d5aebc934e1af5a
git://
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: I1223c11c88937bd0cbd9b086d4ef216985797302
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M drivers/irqchip/irq-gic-v3.c
ap...@google.com <ap...@google.com> #95
Branch: chromeos-5.15
commit c420546df25a63d9cb3a0acba11fc4afba3fd01f
Author: Douglas Anderson <dianders@chromium.org>
Date: Thu Jun 01 14:31:50 2023
UPSTREAM: kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB
To save architectures from needing to wrap the call in #ifdefs, add a
stub no-op version of kgdb_nmicallback(), which returns 1 if it didn't
handle anything.
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
(cherry picked from commit 8117f948f12bc559edf40916e7693512c8c9a50b)
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Ia3aeac89bb6751b682237e76e5ba594318e4b1aa
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M include/linux/kgdb.h
di...@google.com <di...@google.com> #96
So I think we're all set to enable this, at least on boards running chromeos-5.15+ that have GICv3 and don't have the firmware bug. If anyone thinks we need additional discussion then please yell.
di...@google.com <di...@google.com> #97
FWIW, just the summarize where we are on the performance front:
- We believe that the 4% number seen in hackbench is something close-ish to a worst case. AKA: we don't think any benchmarks would regress too much more than that.
- We believe, but don't know for sure, that macrobenmarks won't regress in a measurable way. In other places (like the
) where hackbench was affected by ~4% we didn't believe it translated into macrobenchmarks.LTO analysis
So the idea is that we could land this and then keep an eye on things. We'll keep an eye on various metrics. If we notice something suddenly regress then we'll pull the breaks on this. If something is not an obvious regression or is more minor, it's possible it could make it into beta/stable. We believe the only way something could make it that far would be if it was very minor. We could still revert but it wouldn't be a huge deal if there was a tiny (1% or 2%) regression for a milestone.
This is similar for stability. If we notice crashes go up then we can always pull the plug on this.
bu...@google.com <bu...@google.com>
ap...@google.com <ap...@google.com> #98
Branch: main
commit f84911264ae6373ef73368b4b248b7837f5a3266
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Nov 01 13:38:58 2023
config: Add info about the Interrupt controller to the HardwareFeatures
At the moment all we need to know is whether the interrupt controller
can support sending NMIs (non maskable interrupts).
NOTE TO SELF: the only file to look at is
`proto/chromiumos/config/api/topology.proto`. Everything else is
generated despite only one of them being under the `generated`
directory.
BUG=b:197061987
TEST=./generate.sh, run_go_unittests.sh, run_py_unittests.sh
Change-Id: Ic680e11590d3f4efa387dafa1144ffd26c17bb78
Reviewed-on:
Reviewed-by: Seewai Fu <seewaifu@google.com>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Cindy Lin <xcl@google.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M generated/descriptors.json
M go/api/topology.pb.go
M proto/chromiumos/config/api/topology.proto
M python/chromiumos/config/api/topology_pb2.py
ap...@google.com <ap...@google.com> #99
Branch: chromeos-5.10
commit 2402f349c8523afddb44cf74f826c3f2a382731f
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:25:44 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (all backtrace)
The buddy hard lockup detector should try backtracing on all
CPUs. Right now it doesn't. Copy that bit of logic from the normal
hardlockup detector.
NOTE: On arm64 (the current user of the buddy detector), this won't
(yet) do anything. Soon, hopefully.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id225408348d8a45e68080d08139bc6d9e170000a
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M kernel/watchdog_buddy_cpu.c
ap...@google.com <ap...@google.com> #100
Branch: chromeos-5.4
commit 28e15645f7f4a7be395d1ea547849792ea4a1339
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:25:44 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (all backtrace)
The buddy hard lockup detector should try backtracing on all
CPUs. Right now it doesn't. Copy that bit of logic from the normal
hardlockup detector.
NOTE: On arm64 (the current user of the buddy detector), this won't
(yet) do anything. Soon, hopefully.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id225408348d8a45e68080d08139bc6d9e170000a
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M kernel/watchdog_buddy_cpu.c
ap...@google.com <ap...@google.com> #101
Branch: main
commit d1688b45a1c759755814f7363b683146ef92dad6
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Nov 01 15:34:01 2023
tast: Add hwdep for whether a device supports NMIs
All non-ARM boards should support NMIs. We also support NMIs on arm64
boards that don't have GICv2 and don't have a firmware bug preventing
us from using NMIs.
BUG=b:197061987
TEST=./fast_build.sh -T
TEST=test against a variety of boards
Cq-Depend: chromium:4997602
Change-Id: I6126025cf60b5c60cefe7b79c945166a817362bc
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Seewai Fu <seewaifu@google.com>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M src/
M src/
ap...@google.com <ap...@google.com> #102
Branch: main
commit 7b60c373068e945c1acb9854a22ac3f9d4be3ea2
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Nov 01 09:54:23 2023
tast: Add swdep for whether the OS supports NMI backtrace
The Linux kernel supports NMI backtrace if either:
- We're on an x86 board
- We're on an arm64 board with a new enough kernel
Instead of saying "x86", though, we'll say non-ARM. If,
hypothetically, someone made ChromeOS support MIPS, RISC-V, PowerPC,
or whatever then we'd want to assume that the OS supports NMI
backtrace until we've made an explicit decision that it can't be
supported.
Note that on arm64 boards OS support isn't enough. We also need to
make sure that the interrupt controller supports NMI backtrace and
there aren't any firmware quirks that break NMI backtrace. That will
be detected with a corresponding hwdep.
BUG=b:197061987
TEST=./fast_build.sh -T
TEST=test against a variety of boards
Change-Id: Id0fbeb15c4eb068942354915967cffc4edced42c
Reviewed-on:
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Seewai Fu <seewaifu@google.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M docs/test_dependencies.md
M src/
ap...@google.com <ap...@google.com> #103
Branch: main
commit d31f987da2244b593f8d6f6990f30627aca56cdc
Author: Douglas Anderson <dianders@chromium.org>
Date: Tue Oct 31 12:22:17 2023
tast: Allow rebooting a DUT with a command other than "reboot"
There are reasons in a tast test to reboot a DUT with a command other
than "reboot". One example is a tast test that wants to test how the
kernel responds to certain kinds of crashes. Some types of kernel
crashes can be simulated by writing a specific string to
"/sys/kernel/debug/provoke-crash/DIRECT". Writing to this file will
reboot the DUT.
Let's avoid having to duplicate all the logic for handling the DUT
rebooting by abstracting out the "reboot" command.
BUG=b:197061987
TEST=./fast_build.sh -T
TEST=tast test simulating a hardlockup works with this
Change-Id: I0556e7364a8342d719d0cb69b83325435bdb1a02
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Auto-Submit: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Seewai Fu <seewaifu@google.com>
Tested-by: Douglas Anderson <dianders@chromium.org>
M src/
ap...@google.com <ap...@google.com> #104
Branch: chromeos-5.4
commit 1b415cbdc4d2258cb37610c4a66039ebffdec9f6
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:15:31 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (sysctl)
The CHROMIUM patch accidentally didn't expose the hardlockup panic
sysctls based on the right config. Fix it.
NOTE: Only one of these two sysctls actually does something with the
current buddy detector. You can turn on/off the hard lockup detector
but it doesn't (yet) support tracing other CPUs.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id88d1fb603308e7210c30e42bb6e4e6a4be65a0c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
(cherry picked from commit 71986679fe52d94286f9051f09b958ecf582c7fc)
Reviewed-on:
Commit-Queue: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
M kernel/sysctl.c
ap...@google.com <ap...@google.com> #105
Branch: chromeos-5.10
commit 9c9acd59e87ef7423808dfc5b4109732a05f78db
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 17 17:15:31 2023
FIXUP: CHROMIUM: hardlockup: detect hard lockups without NMIs using secondary cpus (sysctl)
The CHROMIUM patch accidentally didn't expose the hardlockup panic
sysctls based on the right config. Fix it.
NOTE: Only one of these two sysctls actually does something with the
current buddy detector. You can turn on/off the hard lockup detector
but it doesn't (yet) support tracing other CPUs.
UPSTREAM-TASK=b:172213097
BUG=b:278598383, b:278594093, b:197061987, b:172213097
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Change-Id: Id88d1fb603308e7210c30e42bb6e4e6a4be65a0c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
(cherry picked from commit 71986679fe52d94286f9051f09b958ecf582c7fc)
Reviewed-on:
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Guenter Roeck <groeck@chromium.org>
M kernel/sysctl.c
ap...@google.com <ap...@google.com> #106
Branch: main
commit 0c9410bcd960fdb5ad68892473be1d65902a9a9d
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Oct 09 14:55:24 2023
arm boards: Set `irqchip.gicv3_pseudo_nmi=1` on the kernel command line
In order to get pseudo-NMI enabled, we not only have to enable the
kernel config `CONFIG_ARM64_PSEUDO_NMI` but we _also_ have to set a
kernel command line.
The reason for need both the config and command line is that, for a
long time, pseudo-NMI was considered unstable and also had big
performance implications. The idea was that a Linux distribution could
enable pseudo-NMI support in the kernel without exposing everyone to
instability and performance issues. Individual users who wanted
pseudo-NMI could then turn it on via the kernel command line and
accept the downsides.
Today, pseudo-NMI still has _some_ performance issues but they are
fairly minor. The small decrease in some benchmarks is justified by
the fact that we'll be able to gather much better traces of locked up
systems.
Today, pseudo-NMI appears to be quite stable, though Mark Rutland at
ARM still expresses some worries [1] [2] we are not aware of any
concrete problems and believe the feature to be stable.
[1]
[2]
A few notes:
- Though this looks like it's turning the config on for every ARM
board, it actually isn't. We've only turned on the kernel config for
chromeos-5.15+ since we haven't backported all the support to
previous kernels. All boards are listed here so that they can get
enabled whenever they uprev.
- For some Mediatek boards, pseudo-NMI won't even be enabled after
they uprev to a newer kernel despite this kernel command line
argument and the `CONFIG_ARM64_PSEUDO_NMI` kernel config. This is
because many older Mediatek boards have a bug in their
firmware. Those boards are still listed here so that when we uprev
the firmware of those boards then pseudo-NMI will get enabled. See
- For oak/elm (mt8173) we don't bother enabling this because those
boards don't have GICv3 and thus this command line option doesn't
make sense there.
BUG=b:197061987
TEST=pseudo-NMI works on trogdor and mtk boards w/ new firmware
Cq-Depend: chrome-internal:6542779
Change-Id: I99b47c2b5ea97bd7b0d99857824723f49a0a1b60
Reviewed-on:
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
M baseboard-asurada/scripts/build_kernel_image.sh
M baseboard-cherry/scripts/build_kernel_image.sh
M baseboard-corsola/scripts/build_kernel_image.sh
M baseboard-geralt/scripts/build_kernel_image.sh
M baseboard-kukui/scripts/build_kernel_image.sh
M baseboard-trogdor/scripts/build_kernel_image.sh
di...@google.com <di...@google.com> #107
Leaving this open for now to get tests landed.
ap...@google.com <ap...@google.com> #108
Branch: main
commit dec6c228466d4a90b0e67797bba6cfacc71b1802
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Jan 17 13:03:24 2024
tast: Update swdep for whether a device supports testing NMI backtrace
As talked about in
to run for making sure that we collect hardlockups isn't 100% reliable
on kernels before v6.1. Update the swdep so we only test on kernel
6.1+.
BUG=b:197061987, b:309669058
TEST=crash.KernelCrash.hard_lockup
TEST=./fast_build.sh -T
Change-Id: Ia88081933a73a528497fa19c1f865f62e5f59883
Reviewed-on:
Reviewed-by: Seewai Fu <seewaifu@google.com>
Reviewed-by: Miriam Zimmerman <mutexlox@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
M src/
ap...@google.com <ap...@google.com> #109
Branch: main
commit 1633215adfda635cb42eb461e4765bea7104db2b
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Dec 20 13:35:32 2023
crash: Make sure we capture hardlockups properly
Add a test to make sure that we're capturing hardlockups properly and
generating a good signature.
NOTE: as per comments in the test, if we cause a hardlockup on one CPU
it can plausibly cause a hardlockup on another CPU. We'll be a bit
lenient on the signature checking but we'll make sure that we at least
saw the expected function somewhere in the crash report.
BUG=b:197061987, b:309669058
TEST=crash.KernelCrash.hard_lockup
Change-Id: I19c3c78f9c634b5f5842c6e62582386f1a5cfa31
Reviewed-on:
Tast-Review: Katherine Threlkeld <kathrelkeld@chromium.org>
Reviewed-by: Nancy Zhao <zhaon@google.com>
Reviewed-by: Miriam Zimmerman <mutexlox@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M src/
ap...@google.com <ap...@google.com> #110
Branch: main
commit 8941ce1cb489cb5c821384bf08238d8c193a75ee
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Nov 22 12:18:03 2023
crash: Better checking of the crash signature
In preparation for checking the signature for other types of kernel
crashes besides panic, let's allow each test to specify a signature to
match. Let's also clean up the signature matching code so that the
error message reports the signature that was found as well as the
regex that was used.
BUG=b:197061987
TEST=crash.KernelCrash.mock_consent
Change-Id: I0ca903e76f74d6286feddbfea3a4524c797190f1
Reviewed-on:
Tast-Review: Katherine Threlkeld <kathrelkeld@chromium.org>
Reviewed-by: Miriam Zimmerman <mutexlox@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
M src/
ap...@google.com <ap...@google.com> #111
Branch: main
commit 6fe85ea2184c36207c7c52b7cfee4d1230811c68
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Nov 22 08:38:06 2023
crash: Everything has "provoke-crash" now, no need for /proc/breakme
Kernels have all had the "provoke-crash" logic for a long time
now. There's no reason for the extra fallback to /proc/breakme.
BUG=b:197061987
TEST=crash.KernelCrash.mock_consent
Change-Id: Ife00fa291c88966ec8c1691e67f30f93a9f3fbe7
Reviewed-on:
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tast-Review: Katherine Threlkeld <kathrelkeld@chromium.org>
Reviewed-by: Miriam Zimmerman <mutexlox@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M src/
ap...@google.com <ap...@google.com> #112
Branch: main
commit 4dfb6cc7ffe03163d5187d87332e79a1014721f9
Author: Douglas Anderson <dianders@chromium.org>
Date: Thu Jan 18 09:45:07 2024
crash: Fix incorrect use of cleanupCtx
When getting the meta file, don't use the cleanupCtx which is only
supposed to be used to cleanup after errors.
BUG=b:197061987
TEST=crash.KernelCrash.mock_consent
Change-Id: I22ac24d3e9a08afbb13e723c92bcf0ab5c826527
Reviewed-on:
Reviewed-by: Ian Barkley-Yeung <iby@chromium.org>
Tast-Review: Katherine Threlkeld <kathrelkeld@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
M src/
di...@google.com <di...@google.com> #113
Despite the long list of "pending" CLs that somehow didn't get updated, I think we can close this. We've got arm64 NMI backtraces backported to 5.15+ and the patches to cleanup some of the output in corner cases where two lockups happen at once backported to 6.1+. tast testing is enabled for 6.1+.
Current status:
- trogdor boards are on 5.15 and thus have this enabled. I checked several benchmarks and TPS metrics and didn't see any noticeable performance regression here, and nobody has brought anything to my attention. I'll still keep my ears open just in case.
- older Mediatek boards (elm/hana) have GICv2 which doesn't support pseudoNMI, so those aren't supported.
- many mediatek boards had firmware fixes landed (see
), but those firmwares haven't necessary gone through qual and rolled out. You'd have to check any specific board.b/281831288 - currently pseudo-NMI is
due to a bug. It's promised that this will be re-enabled soon.temporarily disabled on geralt
Closing this.
ap...@google.com <ap...@google.com> #114
Branch: chromeos-5.15
commit 28a8baee37cbfa9b7eecea05ca4e0f4381cf39b1
Author: Linux Patches Robot <linux-patches-robot@chromeos-missing-patches.google.com.iam.gserviceaccount.com>
Date: Wed Jul 17 01:38:17 2024
UPSTREAM: arm64: smp: Fix missing IPI statistics
commit 83cfac95c018 ("genirq: Allow interrupts to be excluded from
/proc/interrupts") is to avoid IPIs appear twice in /proc/interrupts.
But the commit 331a1b3a836c ("arm64: smp: Add arch support for backtrace
using pseudo-NMI") and commit 2f5cd0c7ffde("arm64: kgdb: Implement
kgdb_roundup_cpus() to enable pseudo-NMI roundup") set CPU_BACKTRACE and
KGDB_ROUNDUP IPIs "IRQ_HIDDEN" flag but not show them in
arch_show_interrupts(), which cause the interrupt kstat_irqs accounting
is missing in display.
Before this patch, CPU_BACKTRACE and KGDB_ROUNDUP IPIs are missing:
/ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
11: 466 600 309 332 GICv3 27 Level arch_timer
13: 24 0 0 0 GICv3 33 Level uart-pl011
15: 64 0 0 0 GICv3 78 Edge virtio0
16: 0 0 0 0 GICv3 79 Edge virtio1
17: 0 0 0 0 GICv3 34 Level rtc-pl031
18: 3 3 3 3 GICv3 23 Level arm-pmu
19: 0 0 0 0 9030000.pl061 3 Edge GPIO Key Poweroff
IPI0: 7 14 9 26 Rescheduling interrupts
IPI1: 354 93 233 255 Function call interrupts
IPI2: 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 0 0 Timer broadcast interrupts
IPI5: 1 0 0 0 IRQ work interrupts
Err: 0
After this pacth, CPU_BACKTRACE and KGDB_ROUNDUP IPIs are displayed:
/ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
11: 393 281 532 449 GICv3 27 Level arch_timer
13: 15 0 0 0 GICv3 33 Level uart-pl011
15: 64 0 0 0 GICv3 78 Edge virtio0
16: 0 0 0 0 GICv3 79 Edge virtio1
17: 0 0 0 0 GICv3 34 Level rtc-pl031
18: 2 2 2 2 GICv3 23 Level arm-pmu
19: 0 0 0 0 9030000.pl061 3 Edge GPIO Key Poweroff
IPI0: 11 19 4 23 Rescheduling interrupts
IPI1: 279 347 222 72 Function call interrupts
IPI2: 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 0 0 Timer broadcast interrupts
IPI5: 1 0 0 1 IRQ work interrupts
IPI6: 0 0 0 0 CPU backtrace interrupts
IPI7: 0 0 0 0 KGDB roundup interrupts
Err: 0
Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Doug Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 916b93f4e865b35563902f5862b443fc122631b4)
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Signed-off-by: Linux Patches Robot <linux-patches-robot@chromeos-missing-patches.google.com.iam.gserviceaccount.com>
Change-Id: I6d1c06167fecb8b9a42debef96508efc6e985a11
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
M arch/arm64/kernel/smp.c
ap...@google.com <ap...@google.com> #115
Branch: chromeos-6.1
commit dc1ffc90b404f792aafd1a8046fb0808531bee67
Author: Linux Patches Robot <linux-patches-robot@chromeos-missing-patches.google.com.iam.gserviceaccount.com>
Date: Thu Jul 18 01:41:26 2024
UPSTREAM: arm64: smp: Fix missing IPI statistics
commit 83cfac95c018 ("genirq: Allow interrupts to be excluded from
/proc/interrupts") is to avoid IPIs appear twice in /proc/interrupts.
But the commit 331a1b3a836c ("arm64: smp: Add arch support for backtrace
using pseudo-NMI") and commit 2f5cd0c7ffde("arm64: kgdb: Implement
kgdb_roundup_cpus() to enable pseudo-NMI roundup") set CPU_BACKTRACE and
KGDB_ROUNDUP IPIs "IRQ_HIDDEN" flag but not show them in
arch_show_interrupts(), which cause the interrupt kstat_irqs accounting
is missing in display.
Before this patch, CPU_BACKTRACE and KGDB_ROUNDUP IPIs are missing:
/ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
11: 466 600 309 332 GICv3 27 Level arch_timer
13: 24 0 0 0 GICv3 33 Level uart-pl011
15: 64 0 0 0 GICv3 78 Edge virtio0
16: 0 0 0 0 GICv3 79 Edge virtio1
17: 0 0 0 0 GICv3 34 Level rtc-pl031
18: 3 3 3 3 GICv3 23 Level arm-pmu
19: 0 0 0 0 9030000.pl061 3 Edge GPIO Key Poweroff
IPI0: 7 14 9 26 Rescheduling interrupts
IPI1: 354 93 233 255 Function call interrupts
IPI2: 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 0 0 Timer broadcast interrupts
IPI5: 1 0 0 0 IRQ work interrupts
Err: 0
After this pacth, CPU_BACKTRACE and KGDB_ROUNDUP IPIs are displayed:
/ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
11: 393 281 532 449 GICv3 27 Level arch_timer
13: 15 0 0 0 GICv3 33 Level uart-pl011
15: 64 0 0 0 GICv3 78 Edge virtio0
16: 0 0 0 0 GICv3 79 Edge virtio1
17: 0 0 0 0 GICv3 34 Level rtc-pl031
18: 2 2 2 2 GICv3 23 Level arm-pmu
19: 0 0 0 0 9030000.pl061 3 Edge GPIO Key Poweroff
IPI0: 11 19 4 23 Rescheduling interrupts
IPI1: 279 347 222 72 Function call interrupts
IPI2: 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 0 0 Timer broadcast interrupts
IPI5: 1 0 0 1 IRQ work interrupts
IPI6: 0 0 0 0 CPU backtrace interrupts
IPI7: 0 0 0 0 KGDB roundup interrupts
Err: 0
Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Doug Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link:
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(cherry picked from commit 916b93f4e865b35563902f5862b443fc122631b4)
BUG=b:197061987
TEST=echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
Signed-off-by: Linux Patches Robot <linux-patches-robot@chromeos-missing-patches.google.com.iam.gserviceaccount.com>
Change-Id: I6d1c06167fecb8b9a42debef96508efc6e985a11
Reviewed-on:
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
M arch/arm64/kernel/smp.c
Description
As mentioned in bug 172213097 the buddy lockup detector does not provide meaningful debug information, i.e. stack dumps of the locked up core.
While a true hardlockup detector such as proposed in bug 172228850 would provide such information, not all platforms can support it. The platforms left out would still end up using the buddy lockup detector, or none at all.
Recently a series [1] that uses pseudo-NMIs on arm64 for IPIs allows getting NMI backtraces on hard locked up cores. This was tested on RK3399 on mainline [2] with:
Subsequently when an RCU stall is detected, an NMI backtrace is logged:
SysRq backtraces also work properly.
Without pseudo-NMIs enabled, only a task dump is logged:
And SysRq backtraces aren't available for all cores.
[1]https://lore.kernel.org/linux-arm-kernel/1604317487-14543-1-git-send-email-sumit.garg@linaro.org/ https://lore.kernel.org/linux-arm-kernel/CAGb2v66mVoWiCibjq25d3Z8OvbWNO9p+vMo761RJLiD-BqVbqw@mail.gmail.com/
[2]