display clks stuck at off due to broken rcg clk parking logic in the qualcomm clk driver [319956935]

Fixed

Bug

KernelCI CrOS issues

Status Update

No update yet.

Description

sw...@google.com

created issue #1

Jan 13, 2024 04:47AM

The rcg clk parking logic in the qualcomm clk driver is broken. The original parking logic was written in commit 7ef6f11887bd ("clk: qcom: Configure the RCGs to a safe source as needed")[1] and then rewritten in commit 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for parked RCGs")[2]. The original implementation's flaw was that it left the RCG (root clock generator) dirty with the frequency that the clk framework had set on the clk instead of the safe frequency (XO speed). This caused two problems. Confusion when debugging because the register contents didn't match what the clk was actually programmed for and a real problem once the GDSC itself started hitting the RCG's "update" bit when powering on. Once the GDSC started trying to enable the clk by clearing the dirty state it caused clks to get stuck because the clk registers would be dirty with a configuration that selected a parent that's off (e.g. the display PLL).

The rewritten patch fixed this by caching the configuration register contents in software instead of in hardware. Moving the intended register setting out of the register fixed the problem because the GDSC enabling would only cause XO to be selected, and thus no stuck clk could occur. There's one problem in the patch though. The cached configuration register needs to be populated when the clk is registered, otherwise the contents are all zero, meaning source from XO. If an RCG is sourcing something besides XO (which we are on Trogdor) it will be switched over to XO when the clk is enabled the first time.

This design is still broken though because of another issue. The GDSC may need to enable more than one RCG to do "housekeeping" while turning on. In the case of display clks, the display GDSC enables more than one clk. If all the clks haven't been parked and some parent has been disabled then the GDSC is going to lock up and not actually turn on. This will lead to clks that otherwise have been parked not turning on properly. This is very confusing.

I diagnosed this problem in this thread. Here's what happens:

mdp and rot rcgs are sourcing from disp_cc_pll0 at driver probe, disp_cc_pll0 is enabled, mdp rcg/branch and rot rcg/branch are enabled
mdp rcg is configured to restore on XO (because of the wrong cached register)
rot rcg is configured to restore on XO (because of the wrong cached register)
mdp rcg is switched to XO on clk_enable()
mdp rcg is switched to XO on clk_disable()
disp_cc_pll0 is left untouched
rot branch clk is disabled during late init (disp_cc_mdss_rot_clk)
rot rcg is still enabled
disp_cc_pll0 is disabled during late init
mdp rcg is switched to XO on clk_enable()
mdp branch is stuck off

[    0.511452] disp_cc_mdss_mdp_clk status stuck at 'off'
[    0.511475] WARNING: CPU: 1 PID: 81 at drivers/clk/qcom/clk-branch.c:87 clk_branch_toggle+0x114/0x168
[    0.511498] Modules linked in:
[    0.511508] CPU: 1 PID: 81 Comm: kworker/u16:4 Not tainted 6.7.0-g0dd3ee311255 #1 f5757d475795053fd2ad52247a070cd50dd046f2
[    0.511519] Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
[    0.511526] Workqueue: events_unbound deferred_probe_work_func
[    0.511540] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.511549] pc : clk_branch_toggle+0x114/0x168
[    0.511558] lr : clk_branch_toggle+0x110/0x168
[    0.511567] sp : ffffffc08084b670
[    0.511573] pmr_save: 00000060
[    0.511578] x29: ffffffc08084b680 x28: ffffff808006de00 x27: 0000000000000001
[    0.511591] x26: ffffff8080dbd4f4 x25: 0000000000000000 x24: 0000000000000000
[    0.511604] x23: 0000000000000000 x22: ffffffd838461198 x21: ffffffd838007997
[    0.511616] x20: ffffffd837541d5c x19: 0000000000000001 x18: 0000000000000004
[    0.511628] x17: 0000000000000000 x16: 0000000000000010 x15: ffffffd837070fac
[    0.511640] x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001
[    0.511651] x11: c0000000ffffdfff x10: ffffffd838347aa0 x9 : 08dadf92e516c000
[    0.511663] x8 : 08dadf92e516c000 x7 : 0000000000000000 x6 : 0000000000000027
[    0.511675] x5 : ffffffd8385a61f2 x4 : 0000000000000000 x3 : ffffffc08084b398
[    0.511687] x2 : ffffffc08084b3a0 x1 : 00000000ffffdfff x0 : 00000000fffffff0
[    0.511699] Call trace:
[    0.511705]  clk_branch_toggle+0x114/0x168
[    0.511715]  clk_branch2_enable+0x24/0x30
[    0.511725]  clk_core_enable+0x5c/0x1c8
[    0.511735]  clk_enable+0x38/0x58
[    0.511744]  clk_bulk_enable+0x40/0xb0
[    0.511753]  mdss_runtime_resume+0x68/0x258
[    0.511764]  pm_generic_runtime_resume+0x30/0x44
[    0.511775]  __genpd_runtime_resume+0x30/0x80
[    0.511786]  genpd_runtime_resume+0x124/0x214
[    0.511795]  __rpm_callback+0x7c/0x15c
[    0.511805]  rpm_callback+0x30/0x88
[    0.511813]  rpm_resume+0x390/0x4d8
[    0.511823]  rpm_resume+0x43c/0x4d8
[    0.511831]  __pm_runtime_resume+0x54/0x98
[    0.511839]  __device_attach+0xe0/0x170
[    0.511847]  device_initial_probe+0x1c/0x28
[    0.511854]  bus_probe_device+0x48/0xa4
[    0.511864]  device_add+0x52c/0x6fc
[    0.511872]  mipi_dsi_device_register_full+0x104/0x1a8
[    0.511882]  devm_mipi_dsi_device_register_full+0x28/0x78
[    0.511891]  ti_sn_bridge_probe+0x1dc/0x2bc
[    0.511900]  auxiliary_bus_probe+0x4c/0x94
[    0.511909]  really_probe+0xf8/0x270
[    0.511917]  __driver_probe_device+0xa8/0x130
[    0.511925]  driver_probe_device+0x44/0x104
[    0.511934]  __device_attach_driver+0xa4/0xcc
[    0.511941]  bus_for_each_drv+0x94/0xe8
[    0.511951]  __device_attach+0xf8/0x170
[    0.511959]  device_initial_probe+0x1c/0x28
[    0.511967]  bus_probe_device+0x48/0xa4
[    0.511976]  deferred_probe_work_func+0x9c/0xd8
[    0.511984]  process_scheduled_works+0x1ac/0x4dc
[    0.511995]  worker_thread+0x110/0x320
[    0.512004]  kthread+0x100/0x120
[    0.512013]  ret_from_fork+0x10/0x20
[    0.512023] ---[ end trace 0000000000000000 ]---

More specifically, the way we disable clks during late init and how we skip reading or writing the rcg enable bit causes clks to get stuck when the GDSC is turned on after disabling of unused clks. The RCG has a register bit that turns on the RCG. The branches which are children of that RCG automatically turn on the RCG when they are enabled or disabled themselves. There's a feedback mechanism in hardware that enables the RCG when a child branch is enabled. Given this design, the linux clk driver never writes the RCG enable bit unless it needs to ensure the RCG is on regardless of child branch state. The parking logic writes the rcg enable bit while parking the clk to ensure the parking actually works. Otherwise the parking itself may get stuck because both sources need to be clocking, along with the rcg itself, to ensure a proper switch.

When we disable unused clks, the rot branch is disabled but the rot rcg is not because it doesn't implement struct clk_ops::is_enabled(). Enabling the GDSC after late init wedges all the clks for the display subsystem (mdp and rot) because the GDSC turns on the clks temporarily and gets stuck when the rot rcg is sourcing from the display pll that is disabled.

Another patch series has attempted to implement struct clk_ops::is_enabled() for parked RCGs. The downside of this approach is that it relies on the parent being XO to know that the clk is disabled and therefore XO has to be removed as a possible parent. This makes linking up parents confusing if the RCG is parented to XO during registration. The patch moves the parent to the first frequency in the frequency table during registration to avoid confusing the is_enabled() clk op.

The root cause of the problem is that the enable state of the clk tree is not being carried over properly from the bootloader. The disable unused clk code disables the child of the rcg, but it doesn't know about the parent being enabled and thus the rot rcg is never parked. To make matters worse, if the disable unused logic never runs we'll still have a problem because the display driver turns off the display pll (disp_cc_pll0), causing the rot clk to be stuck and any future attempts to enable the GDSC will wedge the mdp clk until the display pll is enabled (because rot has always been sourcing from the display pll).

A proper fix would probably be to teach the GDSC logic to park RCGs when the GDSC is powered down. This may run into a problem though where the display PLL is disabled before the GDSC is, and then the RCG can't be parked anymore. Another solution is to park all RCGs that are parkable at clk registration time and also clear the RCG enable bit (root enable bit). If we did this we may ruin continuous splash screen, but it would allow the GDSC to be enabled/disabled regardless of clk framework state.

Comments

di...@google.com <di...@google.com> Jan 22, 2024 07:54PM

Assigned to sw...@google.com.

dm...@linaro.org <dm...@linaro.org> #2Jan 23, 2024 12:50AM

I think parking the clocks might cause issues not only to the cont splash screen (which we do not support anyway), but to EFIFB / simplefb.

sw...@google.com <sw...@google.com> #3Jan 26, 2024 11:41PM

There are two problems that have clear solutions:

The parent of parkable RCGs is always assumed to be XO at registration time even when that isn't true
The rcg root enable bit is set in the bootloader and so disabling the child branch doesn't turn off the rcg as expected

We can fix these problems by clearing the root enable bit and reading the rcg at clk registration time so that the parent is in sync and the root enable bit is cleared.

This still leaves us with the problem that the RCG isn't parked when a child branch is disabled during late init. To fix that we can simply mark those branches as CLK_IGNORE_UNUSED and let the display driver eventually probe and turn the clks off.

sw...@google.com <sw...@google.com> #4Feb 24, 2024 06:31PM

This is a problem with the chromeos-5.15 kernel now.

[    0.508609] disp_cc_mdss_rot_clk status stuck at 'on'
[    0.508621] WARNING: CPU: 7 PID: 1 at drivers/clk/qcom/clk-branch.c:92 clk_branch_toggle+0x124/0x16c
[    0.508628] Modules linked in:
[    0.508633] CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.15.148-21808-gf0ad352e8fc4 #1 648f4578b1491a64607b0febe095e018a7a58716
[    0.508637] Hardware name: Google CoachZ (rev1 - 2) (DT)
[    0.508640] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.508643] pc : clk_branch_toggle+0x124/0x16c
[    0.508646] lr : clk_branch_toggle+0x120/0x16c

sw...@google.com <sw...@google.com> #5Mar 26, 2024 10:26PM

Accepted by sw...@google.com.

I've got a two patch series to fix this on sc7180. It implements number 1 and 2 bullet points from comment #3 and then adds CLK_IGNORE_UNUSED to disp_cc_pll0 and disp_cc_mdss_rot_clk so that the rot branch and display pll can't be turned off during late init when the gdsc is already disabled.

sw...@google.com <sw...@google.com> #6Mar 27, 2024 12:07AM

I have another idea, which is to implement a struct clk_ops::disable_unused clk_op that we would set on a parkable rcg branch so that when the branch is found to be enabled during disable_unused, we go up to the parent rcg and park it on XO before disabling the branch. This may solve it better because we don't need to set the CLK_IGNORE_UNUSED flag. But if clk_ignore_unused is on the kernel commandline we'll never run this code to park the rcg on XO, and we still have the problem where the display pll (disp_cc_pll0) may be disabled because the enable_count of the PLL went to zero while disp_cc_mdss_rot_clk is enabled and sourcing from disp_cc_pll0.

In fact, I see that disp_cc_pll0 is disabled and enabled many times before disabling unused clks, so it must be causing disp_cc_mdss_rot_clk to get stuck on after fixing the parent reported for disp_cc_mdss_rot_clk and disp_cc_mdss_mdp_clk to be disp_cc_pll0. Maybe we should manually disable disp_cc_mdss_rot_clk at display driver probe to avoid this scenario. We don't care about this clk being on through boot on sc7180 so this seems acceptable.

TL;DR: the best fix is to implement parking from the gdsc disable path. Doing so will require some register spinlock and pointers to the rcgs that the gdsc manages. And when the gdsc is turned on, we can skip restoring anything because all clks should be disabled when a gdsc is disabled. The corner case is handoff from boot, where qcom,sc7180-mdss compatible node is the device that is runtime resuming and suspending, causing the display pll to be disabled and the gdsc to be turned off, when it really needs to not do that until the child nodes have all probed, e.g. qcom,sc7180-dpu compatible node, and those child devices enable and disable clks like DISP_CC_MDSS_ROT_CLK.

sw...@google.com <sw...@google.com> #7Mar 27, 2024 12:20AM

In fact, I see that disp_cc_pll0 is disabled and enabled many times before disabling unused clks

This also points out that the "qcom,sc7180-mdss" driver is runtime suspending before child devices like "qcom,sc7180-dpu" probe. That sounds wrong.

db...@gmail.com <db...@gmail.com> #8Mar 27, 2024 01:52AM

Now as DSI/DSI PHY and DP have proper runtime-PM status, maybe we should add device links from mdss to its children, keeping it alive? I hoped that it would be enough to propagate status of the children to the MDSS device, but from your text it seems that it is not enough.

sw...@google.com <sw...@google.com> #9Mar 27, 2024 02:34AM

I hoped that it would be enough to propagate status of the children to the MDSS device

I'm still tracking down where mdss first gets turned off. I think device links duplicate what parent child relationships do with runtime PM, so I hope they're not necessary. Can we keep mdss runtime PM active until DPU probes and enables the rotator clk?

sw...@google.com <sw...@google.com> #10Mar 27, 2024 02:39AM

I put some printks in

[    0.303093] sboyd: enabling runtime pm on mdss
[    0.305025] sboyd: disabling mdss_gdsc
[    0.366780] sboyd: disabling mdss_gdsc
[    0.369592] sboyd: disabling mdss_gdsc
[    0.512409] sboyd: disabling mdss_gdsc
[    0.519450] sboyd: disabling mdss_gdsc
[    0.521547] [drm:dpu_kms_init:1196] sboyd: enabling runtime pm on dpu
[    0.521560] [drm:dpu_kms_hw_init:1042] sboyd: runtime resume and get ae01000.display-controller
[    0.522558] [drm:dpu_kms_hw_init:1160] sboyd: runtime put sync ae01000.display-controller
[    0.522821] sboyd: disabling mdss_gdsc

all those disabling mdss_gdsc lines between mdss and dpu are unexpected.

sw...@google.com <sw...@google.com> #11Mar 27, 2024 03:07AM

The first rpm_resume() is from of_platform_populate() where __driver_probe_device() calls pm_runtime_get_sync(dev->parent). I have this dump_stack()

[    0.291929] CPU: 6 PID: 11 Comm: kworker/u16:0 Not tainted 6.8.0-00005-g0e2cb559efb7-dirty #1 8bb035a8a74abeafcde1984f53c37e9e728e208b
[    0.291934] Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
[    0.291938] Workqueue: events_unbound deferred_probe_work_func
[    0.291953] Call trace:
[    0.291956]  dump_backtrace+0xf0/0x10c
[    0.291963]  show_stack+0x20/0x2c
[    0.291967]  dump_stack_lvl+0x84/0xac
[    0.291971]  dump_stack+0x18/0x30
[    0.291974]  rpm_resume+0x334/0x5ac
[    0.291979]  __pm_runtime_resume+0x54/0x98
[    0.291984]  __device_attach+0xe0/0x170
[    0.291989]  device_initial_probe+0x1c/0x28
[    0.291993]  bus_probe_device+0x48/0xa4
[    0.291997]  device_add+0x210/0x3cc
[    0.292001]  of_device_add+0x40/0x50
[    0.292006]  of_platform_device_create_pdata+0x9c/0x118
[    0.292010]  of_platform_bus_create+0x1c8/0x370
[    0.292015]  of_platform_populate+0x7c/0xe4
[    0.292019]  mdss_probe+0x260/0x364

and then it increments another time for pm_runtime_get_suppliers(dev) because fw_devlink decided that the parent is supplying something. Then it looks like __driver_probe_device() calls pm_runtime_put(dev->parent) and pm_runtime_put_suppliers(dev) and that runtime suspends mdss. We can probably put a pm_runtime_get() call around of_platform_populate() to simplify this, but if the dpu driver isn't probing because it isn't builtin or it isn't created as a runtime active device I don't know how we keep mdss out of runtime suspend. I'll look at usb for clues.

sw...@google.com <sw...@google.com> #12Mar 27, 2024 03:40AM

The doc has some info:

However, if the device has a parent and the parent’s runtime PM is enabled, calling
pm_runtime_set_active() for the device will affect the parent, unless the parent’s
‘power.ignore_children’ flag is set. Namely, in that case the parent won’t be able 
to suspend at run time, using the PM core’s helper functions, as long as the child’s 
status is ‘active’, even if the child’s runtime PM is still disabled (i.e. 
pm_runtime_enable() hasn’t been called for the child yet or pm_runtime_disable() 
has been called for it). For this reason, once pm_runtime_set_active() has been 
called for the device, pm_runtime_enable() should be called for it too as soon as 
reasonably possible or its runtime PM status should be changed back to ‘suspended’ 
with the help of pm_runtime_set_suspended().

so maybe we should probe mdss, runtime_pm_get() mdss, set all child devices to active, and then put the runtime pm reference on mdss. If a child is active and not probed yet then I think it will keep mdss runtime active, and thus mdss_gdsc will stay enabled as well.

Message last modified on Mar 27, 2024 03:40AM

sw...@google.com <sw...@google.com> #13Mar 27, 2024 07:00AM

I applied this patch but I ran into an SError at boot on lazor.

diff --git a/drivers/gpu/drm/msm/msm_mdss.c b/drivers/gpu/drm/msm/msm_mdss.c
index 35423d10aafa..106f51de917d 100644
--- a/drivers/gpu/drm/msm/msm_mdss.c
+++ b/drivers/gpu/drm/msm/msm_mdss.c
@@ -484,6 +484,12 @@ static int mdss_probe(struct platform_device *pdev)
 
        platform_set_drvdata(pdev, mdss);
 
+       ret = pm_runtime_resume_and_get(dev);
+       if (ret) {
+               msm_mdss_destroy(mdss);
+               return ret;
+       }
+
        /*
         * MDP5/DPU based devices don't have a flat hierarchy. There is a top
         * level parent: MDSS, and children: MDP5/DPU, DSI, HDMI, eDP etc.
@@ -494,10 +500,11 @@ static int mdss_probe(struct platform_device *pdev)
        if (ret) {
                DRM_DEV_ERROR(dev, "failed to populate children devices\n");
                msm_mdss_destroy(mdss);
-               return ret;
        }
 
-       return 0;
+       pm_runtime_put(dev);
+
+       return ret;
 }
 
 static void mdss_remove(struct platform_device *pdev)
diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index b7708a06dc78..7f8b9d538748 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -20,6 +20,7 @@
 #include <linux/of_irq.h>
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 #include <linux/sysfb.h>
 
 #include "of_private.h"
@@ -181,6 +182,14 @@ static struct platform_device *of_platform_device_create_pdata(
        dev->dev.platform_data = platform_data;
        of_msi_configure(&dev->dev, dev->dev.of_node);
 
+       /*
+        * Initialize devices populated by runtime PM active devices as runtime
+        * PM active as well so that the parent can't runtime suspend until all
+        * active children are runtime suspended.
+        */
+       if (parent && pm_runtime_active(parent))
+               pm_runtime_set_active(&dev->dev);
+
        if (of_device_add(dev) != 0) {
                platform_device_put(dev);
                goto err_clear_flag;

It looks like it's coming from the DP driver.

[    0.528263] msm_dpu ae01000.display-controller: bound ae94000.dsi (ops 0xffffffe24abe1fe0)
[    0.528521] SError Interrupt on CPU7, code 0x00000000be000411 -- SError
[    0.528524] CPU: 7 PID: 122 Comm: kworker/7:25 Tainted: G        W          6.8.0-00005-g0e2cb559efb7-dirty #1 99a8035332e11cd1ac2ef7f9244e5239b4958e1f
[    0.528526] Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
[    0.528527] Workqueue: pm pm_runtime_work
[    0.528531] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.528534] pc : mutex_unlock+0x28/0x60
[    0.528537] lr : dp_aux_deinit+0x40/0x50
[    0.528541] sp : ffffffc080d3bbc0
[    0.528542] pmr_save: 000000e0
[    0.528543] x29: ffffffc080d3bbc0 x28: 00000000fffffff5 x27: ffffffc080d3bcf8
[    0.528546] x26: 0000000000000002 x25: 0000000000000000 x24: 0000000000000000
[    0.528549] x23: ffffff80832faae0 x22: ffffffe24a3c4c6c x21: 0000000000000000
[    0.528551] x20: ffffff8082565890 x19: ffffff80825658e8 x18: 0000000000000000
[    0.528554] x17: 0000000000000000 x16: 0000000000000007 x15: ffffffe24b11aa70
[    0.528557] x14: 000000000000007c x13: 0000000000c0000e x12: 0000000000f0000f
[    0.528559] x11: 00000000510f8040 x10: 0000000000000000 x9 : ffffff808251da00
[    0.528562] x8 : ffffff808251da00 x7 : 0000000000000000 x6 : 672e38462d4bb1a6
[    0.528564] x5 : 0000000000000001 x4 : 000000001f8066e6 x3 : 0000000000000010
[    0.528567] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff8082565890
[    0.528570] Kernel panic - not syncing: Asynchronous SError Interrupt
[    0.528571] CPU: 7 PID: 122 Comm: kworker/7:25 Tainted: G        W          6.8.0-00005-g0e2cb559efb7-dirty #1 99a8035332e11cd1ac2ef7f9244e5239b4958e1f
[    0.528574] Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
[    0.528574] Workqueue: pm pm_runtime_work
[    0.528577] Call trace:
[    0.528578]  dump_backtrace+0xf0/0x10c
[    0.528580]  show_stack+0x20/0x2c
[    0.528582]  dump_stack_lvl+0x84/0xac
[    0.528584]  dump_stack+0x18/0x30
[    0.528585]  panic+0x13c/0x348
[    0.528589]  nmi_panic+0x4c/0x90
[    0.528590]  arm64_serror_panic+0x70/0x7c
[    0.528593]  is_valid_bugaddr+0x0/0x10
[    0.528595]  el1h_64_error_handler+0x34/0x48
[    0.528596]  el1h_64_error+0x7c/0x80
[    0.528598]  mutex_unlock+0x28/0x60
[    0.528600]  dp_aux_deinit+0x40/0x50
[    0.528603]  dp_pm_runtime_suspend+0x84/0xa4
[    0.528605]  pm_generic_runtime_suspend+0x30/0x44
[    0.528607]  genpd_runtime_suspend+0xfc/0x220
[    0.528611]  __rpm_callback+0x84/0x138
[    0.528614]  rpm_callback+0x30/0x88
[    0.528616]  rpm_suspend+0x168/0x520
[    0.528619]  pm_runtime_work+0x84/0xa8
[    0.528622]  process_scheduled_works+0x1ac/0x3f4
[    0.528625]  worker_thread+0x208/0x31c
[    0.528628]  kthread+0x100/0x1b8
[    0.528630]  ret_from_fork+0x10/0x20
[    0.528633] SMP: stopping secondary CPUs
[    0.528727] Kernel Offset: 0x21c9e00000 from 0xffffffc080000000
[    0.528729] PHYS_OFFSET: 0x80000000
[    0.528730] CPU features: 0x0,80000061,5002814a,e100720b
[    0.528733] Memory Limit: none

I also see an error with geni so maybe child devices that are of_platform_populate()d need more care if the parent is runtime active before populating.

[    0.259996] gcc_qupv3_wrap1_s0_clk already disabled
[    0.260026] WARNING: CPU: 0 PID: 10 at drivers/clk/clk.c:1178 clk_core_disable+0x90/0x1b0
[    0.260049] Modules linked in:
[    0.260060] CPU: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.8.0-00005-g0e2cb559efb7-dirty #1 9d3d495907444b6453520dc7dde1f9993743cd6d
[    0.260072] Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
[    0.260079] Workqueue: pm pm_runtime_work
[    0.260096] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.260106] pc : clk_core_disable+0x90/0x1b0
[    0.260117] lr : clk_core_disable+0x90/0x1b0
[    0.260127] sp : ffffffc0800a3b60
[    0.260134] pmr_save: 00000060
[    0.260139] x29: ffffffc0800a3b60 x28: 00000000fffffff5 x27: ffffffc0800a3cf8
[    0.260154] x26: 0000000000000002 x25: 0000000000000000 x24: 0000000000000000
[    0.260167] x23: ffffff8082547de0 x22: ffffffe0563c4c48 x21: 0000000000000000
[    0.260181] x20: 00000000000000e0 x19: ffffff8081de3a00 x18: ffffffe056aeceb4
[    0.260194] x17: 0000000000000000 x16: 00000000000000c0 x15: ffffffe056ab6fd4
[    0.260208] x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001
[    0.260221] x11: c0000000ffffdfff x10: ffffffe057167e20 x9 : 852f8c4ebbcfe700
[    0.260234] x8 : 852f8c4ebbcfe700 x7 : 0000000000000000 x6 : 322e30202020205b
[    0.260247] x5 : ffffffe0573fb193 x4 : 0000000000000000 x3 : ffffffc0800a3888
[    0.260261] x2 : ffffffc0800a3890 x1 : 00000000ffffdfff x0 : 0000000000000027
[    0.260275] Call trace:
[    0.260281]  clk_core_disable+0x90/0x1b0
[    0.260293]  clk_disable+0x38/0x50
[    0.260302]  geni_se_clks_off+0x24/0x5c
[    0.260314]  geni_se_resources_off+0x34/0x48
[    0.260325]  spi_geni_runtime_suspend+0x30/0x4c
[    0.260337]  pm_generic_runtime_suspend+0x30/0x44
[    0.260346]  genpd_runtime_suspend+0xfc/0x220
[    0.260356]  __rpm_callback+0x84/0x138
[    0.260365]  rpm_callback+0x30/0x88
[    0.260375]  rpm_suspend+0x168/0x520
[    0.260384]  pm_runtime_work+0x84/0xa8
[    0.260393]  process_scheduled_works+0x1ac/0x3f4
[    0.260405]  worker_thread+0x208/0x31c
[    0.260415]  kthread+0x100/0x1b8
[    0.260426]  ret_from_fork+0x10/0x20

ah yeah, pm_runtime_active() also returns true for devices that don't have runtime PM enabled for them, so weird. Here's a patch that uses pm_runtime_status_suspended() and also moves the DP device back to being suspended.

diff --git a/drivers/gpu/drm/msm/dp/dp_display.c b/drivers/gpu/drm/msm/dp/dp_display.c
index 4c72124ffb5d..c1e01208d87a 100644
--- a/drivers/gpu/drm/msm/dp/dp_display.c
+++ b/drivers/gpu/drm/msm/dp/dp_display.c
@@ -1306,6 +1306,7 @@ static int dp_display_probe(struct platform_device *pdev)
 
        platform_set_drvdata(pdev, &dp->dp_display);
 
+       pm_runtime_set_suspended(&pdev->dev);
        rc = devm_pm_runtime_enable(&pdev->dev);
        if (rc)
                goto err;
diff --git a/drivers/gpu/drm/msm/msm_mdss.c b/drivers/gpu/drm/msm/msm_mdss.c
index 35423d10aafa..106f51de917d 100644
--- a/drivers/gpu/drm/msm/msm_mdss.c
+++ b/drivers/gpu/drm/msm/msm_mdss.c
@@ -484,6 +484,12 @@ static int mdss_probe(struct platform_device *pdev)
 
        platform_set_drvdata(pdev, mdss);
 
+       ret = pm_runtime_resume_and_get(dev);
+       if (ret) {
+               msm_mdss_destroy(mdss);
+               return ret;
+       }
+
        /*
         * MDP5/DPU based devices don't have a flat hierarchy. There is a top
         * level parent: MDSS, and children: MDP5/DPU, DSI, HDMI, eDP etc.
@@ -494,10 +500,11 @@ static int mdss_probe(struct platform_device *pdev)
        if (ret) {
                DRM_DEV_ERROR(dev, "failed to populate children devices\n");
                msm_mdss_destroy(mdss);
-               return ret;
        }
 
-       return 0;
+       pm_runtime_put(dev);
+
+       return ret;
 }
 
 static void mdss_remove(struct platform_device *pdev)
diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index b7708a06dc78..e22e7d4e7392 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -20,6 +20,7 @@
 #include <linux/of_irq.h>
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 #include <linux/sysfb.h>
 
 #include "of_private.h"
@@ -181,6 +182,14 @@ static struct platform_device *of_platform_device_create_pdata(
        dev->dev.platform_data = platform_data;
        of_msi_configure(&dev->dev, dev->dev.of_node);
 
+       /*
+        * Initialize devices populated by runtime PM active devices as runtime
+        * PM active as well so that the parent can't runtime suspend until all
+        * active children are runtime suspended.
+        */
+       if (parent && !pm_runtime_status_suspended(parent))
+               pm_runtime_set_active(&dev->dev);
+
        if (of_device_add(dev) != 0) {
                platform_device_put(dev);
                goto err_clear_flag;

This patch seems to work nicely. I don't need to forcibly disable the rot clk if I do this. Maybe I'll make a new of_platform_populate() that pushes through the runtime PM active state so that other drivers have to opt in.

sw...@google.com <sw...@google.com> #14Mar 27, 2024 07:17AM

I get a bunch of these splats though. Marking the device as runtime PM active before probing it causes pm_runtime_get() and friends to skip the runtime resume callback that the driver uses to turn on clks. But it also prevents the mdss_gdsc from being turned off until the child devices probe and enable runtime PM.

[    0.525175] disp_cc_mdss_vsync_clk already disabled
[    0.525202] WARNING: CPU: 1 PID: 13 at drivers/clk/clk.c:1178 clk_core_disable+0x90/0x1b0
[    0.525220] Modules linked in:
[    0.525230] CPU: 1 PID: 13 Comm: kworker/u16:1 Not tainted 6.8.0-00006-g55380d555520-dirty #1 f2a10dccee3a324871e43bb62817fac4a1855722
[    0.525242] Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
[    0.525250] Workqueue: events_unbound deferred_probe_work_func
[    0.525264] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    0.525274] pc : clk_core_disable+0x90/0x1b0
[    0.525283] lr : clk_core_disable+0x90/0x1b0
[    0.525292] sp : ffffffc0800bb630
[    0.525298] pmr_save: 00000060
[    0.525303] x29: ffffffc0800bb630 x28: 00000000fffffff5 x27: ffffffc0800bb7b8
[    0.525317] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[    0.525330] x23: ffffff80823a0a80 x22: ffffffd814fc4c48 x21: ffffff8082ba9080
[    0.525342] x20: 00000000000000e0 x19: ffffff8080d99600 x18: 0000000000000000
[    0.525354] x17: 0000000000000000 x16: 00000000000000c0 x15: ffffffd8156b7094
[    0.525366] x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001
[    0.525378] x11: c0000000ffffdfff x10: ffffffd815d67e20 x9 : a1387faf952ba900
[    0.525390] x8 : a1387faf952ba900 x7 : 0000000000000000 x6 : 352e30202020205b
[    0.525402] x5 : ffffffd815ffb193 x4 : 0000000000000000 x3 : ffffffc0800bb358
[    0.525414] x2 : ffffffc0800bb360 x1 : 00000000ffffdfff x0 : 0000000000000027
[    0.525428] Call trace:
[    0.525434]  clk_core_disable+0x90/0x1b0
[    0.525443]  clk_disable+0x38/0x50
[    0.525452]  clk_bulk_disable+0x34/0x4c
[    0.525460]  dpu_runtime_suspend+0x40/0x94
[    0.525472]  pm_generic_runtime_suspend+0x30/0x44
[    0.525482]  genpd_runtime_suspend+0xfc/0x220
[    0.525492]  __rpm_callback+0x84/0x138
[    0.525501]  rpm_callback+0x30/0x88
[    0.525510]  rpm_suspend+0x168/0x520
[    0.525518]  rpm_idle+0x5c/0x2b4
[    0.525527]  __pm_runtime_idle+0x48/0x100
[    0.525535]  dpu_kms_hw_init+0x478/0x49c
[    0.525545]  msm_drm_kms_init+0xb4/0x228
[    0.525554]  msm_drm_bind+0x278/0x3b8
[    0.525562]  try_to_bring_up_aggregate_device+0x16c/0x1b8
[    0.525572]  __component_add+0xb4/0x178
[    0.525581]  component_add+0x1c/0x28
[    0.525590]  dsi_dev_attach+0x24/0x30
[    0.525599]  dsi_host_attach+0xe8/0x11c
[    0.525609]  devm_mipi_dsi_attach+0x40/0xc8
[    0.525618]  ti_sn_bridge_probe+0x248/0x2c0
[    0.525628]  auxiliary_bus_probe+0x4c/0x94
[    0.525636]  really_probe+0xf8/0x270
[    0.525645]  __driver_probe_device+0xa8/0x130
[    0.525654]  driver_probe_device+0x44/0x104
[    0.525664]  __device_attach_driver+0xa4/0xcc
[    0.525673]  bus_for_each_drv+0x94/0xe8
[    0.525681]  __device_attach+0xf8/0x170
[    0.525690]  device_initial_probe+0x1c/0x28
[    0.525698]  bus_probe_device+0x48/0xa4
[    0.525706]  deferred_probe_work_func+0x9c/0xd8
[    0.525714]  process_scheduled_works+0x1ac/0x3f4
[    0.525725]  worker_thread+0x208/0x31c
[    0.525735]  kthread+0x100/0x1b8
[    0.525744]  ret_from_fork+0x10/0x20

sw...@google.com <sw...@google.com> #15Mar 27, 2024 08:27PM

I've realized that trying to make sure the mdss device doesn't runtime suspend until all child devices probe is the sync_state problem. We don't know if the child device drivers will ever probe, and we want the device drivers to probe so they can get their clks and enable them to match the runtime PM state of the device as "active". The problem here is much more concrete than the abstract sync_state problem though. We know what child devices form the display pipeline, and we don't probe the component device until all those display pipeline drivers probe. One idea is to runtime_pm_get() mdss in probe and then put that reference in the component driver probe. Another idea is to move the clk enabling and disabling logic into pm domains that are attached to the devices (not the drivers) in the mdss driver. Doing this would let us power on all those pm_domains and associated clks in mdss probe and power them down together. Or we teach gdsc code to park RCGs when the gdsc is powered down. Once we start reporting the real parent out of boot we'll have a problem where the display pll will be turned off and the gdsc driver will try to park the rcg and it won't be able to because the shared parent (display pll) was disabled by another clk.

sw...@google.com <sw...@google.com> #16Mar 27, 2024 08:28PM

Sent patch series at https://lore.kernel.org/r/20240327202740.3075378-1-swboyd@chromium.org

ro...@google.com <ro...@google.com> #17Mar 27, 2024 08:58PM

This problem comes about because of a combination of issues. The clk framework doesn't handle the case where two clks share the same parent and are enabled at boot. The first clk to enable and disable will turn off the parent.

side note (not to derail the discussion), but this is kinda where I got stuck many years back when I attempted to implement efifb->drm handover (on apq8016, so a simpler clk tree).. I wonder if we could try to get the prepare/enable counts to match hw state when kernel starts? Ie. for each enabled leaf node, walk up the current parents to the root incrementing the prepare+enable counts?

ap...@google.com <ap...@google.com> #18Mar 28, 2024 04:52PM

Project: chromiumos/third_party/kernel
Branch: chromeos-6.6

commit 74d088b425d7b58e71c002e4480c2c0d96ce4e99
Author: Stephen Boyd <swboyd@chromium.org>
Date: Wed Mar 27 13:27:38 2024

FROMLIST: clk: qcom: Properly initialize shared RCGs upon registration

There's two problems with shared RCGs.

The first problem is that they incorrectly report the parent after
commit 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for
parked RCGs"). That's because the cached CFG register value needs to be
populated when the clk is registered. clk_rcg2_shared_enable() writes
the cached CFG register value 'parked_cfg'. This value is initially zero
due to static initializers. If a driver calls clk_enable() first before
setting a rate or parent, it will set the parent to '0' which is
(almost?) always XO, and may not reflect the real parent. In the worst
case, this switches the RCG from sourcing a fast PLL to the slow crystal
speed.

The second problem is that the force enable bit isn't cleared. The force
enable bit is only used during parking and unparking of shared RCGs.
Otherwise it shouldn't be set because it keeps the RCG enabled even when
all the branches on the output of the RCG are disabled (the hardware has
a feedback mechanism so that any child branches keep the RCG enabled
when the branch enable bit is set). This problem wastes power if the clk
is unused, and is harmful in the case that the clk framework disables
the parent of the force enabled RCG. In the latter case, the GDSC the
shared RCG is associated with will get wedged if the RCG's source clk is
disabled and the GDSC tries to enable the RCG to do "housekeeping" while
powering on.

Both of these problems combined with incorrect runtime PM usage in the
display driver lead to a black screen on Qualcomm sc7180 Trogdor
chromebooks. What happens is that the bootloader leaves the
'disp_cc_mdss_rot_clk' enabled and the 'disp_cc_mdss_rot_clk_src' force
enabled and parented to 'disp_cc_pll0'. The mdss driver probes and
runtime suspends, disabling the mdss_gdsc which uses the
'disp_cc_mdss_rot_clk_src' for "housekeeping". The
'disp_cc_mdss_rot_clk' is disabled during late init because the clk is
unused, but the parent 'disp_cc_mdss_rot_clk_src' is still force enabled
because the force enable bit was never cleared. Then 'disp_cc_pll0' is
disabled because it is also unused. That's because the clk framework
believes the parent of the RCG is XO when it isn't. A child device of
the mdss device (e.g. DSI) runtime resumes mdss which powers on the
mdss_gdsc. This wedges the GDSC because 'disp_cc_mdss_rot_clk_src' is
parented to 'disp_cc_pll0' and that PLL is off. With the GDSC wedged,
mdss_runtime_resume() tries to enable 'disp_cc_mdss_mdp_clk' but it
can't because the GDSC has wedged all the clks associated with the GDSC.

disp_cc_mdss_mdp_clk status stuck at 'off'
WARNING: CPU: 1 PID: 81 at drivers/clk/qcom/clk-branch.c:87 clk_branch_toggle+0x114/0x168
Modules linked in:
CPU: 1 PID: 81 Comm: kworker/u16:4 Not tainted 6.7.0-g0dd3ee311255 #1 f5757d475795053fd2ad52247a070cd50dd046f2
Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : clk_branch_toggle+0x114/0x168
lr : clk_branch_toggle+0x110/0x168
sp : ffffffc08084b670
pmr_save: 00000060
x29: ffffffc08084b680 x28: ffffff808006de00 x27: 0000000000000001
x26: ffffff8080dbd4f4 x25: 0000000000000000 x24: 0000000000000000
x23: 0000000000000000 x22: ffffffd838461198 x21: ffffffd838007997
x20: ffffffd837541d5c x19: 0000000000000001 x18: 0000000000000004
x17: 0000000000000000 x16: 0000000000000010 x15: ffffffd837070fac
x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001
x11: c0000000ffffdfff x10: ffffffd838347aa0 x9 : 08dadf92e516c000
x8 : 08dadf92e516c000 x7 : 0000000000000000 x6 : 0000000000000027
x5 : ffffffd8385a61f2 x4 : 0000000000000000 x3 : ffffffc08084b398
x2 : ffffffc08084b3a0 x1 : 00000000ffffdfff x0 : 00000000fffffff0
Call trace:
clk_branch_toggle+0x114/0x168
clk_branch2_enable+0x24/0x30
clk_core_enable+0x5c/0x1c8
clk_enable+0x38/0x58
clk_bulk_enable+0x40/0xb0
mdss_runtime_resume+0x68/0x258
pm_generic_runtime_resume+0x30/0x44
__genpd_runtime_resume+0x30/0x80
genpd_runtime_resume+0x124/0x214
__rpm_callback+0x7c/0x15c
rpm_callback+0x30/0x88
rpm_resume+0x390/0x4d8
rpm_resume+0x43c/0x4d8
__pm_runtime_resume+0x54/0x98
__device_attach+0xe0/0x170
device_initial_probe+0x1c/0x28
bus_probe_device+0x48/0xa4
device_add+0x52c/0x6fc
mipi_dsi_device_register_full+0x104/0x1a8
devm_mipi_dsi_device_register_full+0x28/0x78
ti_sn_bridge_probe+0x1dc/0x2bc
auxiliary_bus_probe+0x4c/0x94
really_probe+0xf8/0x270
__driver_probe_device+0xa8/0x130
driver_probe_device+0x44/0x104
__device_attach_driver+0xa4/0xcc
bus_for_each_drv+0x94/0xe8
__device_attach+0xf8/0x170
device_initial_probe+0x1c/0x28
bus_probe_device+0x48/0xa4
deferred_probe_work_func+0x9c/0xd8
process_scheduled_works+0x1ac/0x4dc
worker_thread+0x110/0x320
kthread+0x100/0x120
ret_from_fork+0x10/0x20

Fixes: 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for parked RCGs")
Reported-by: Stephen Boyd <sboyd@kernel.org>
Closes:

https://lore.kernel.org/r/1290a5a0f7f584fcce722eeb2a1fd898.sboyd@kernel.org
Closes:

https://issuetracker.google.com/319956935
Reported-by: Laura Nao <laura.nao@collabora.com>
Closes:

https://lore.kernel.org/r/20231218091806.7155-1-laura.nao@collabora.com
Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Taniya Das <quic_tdas@quicinc.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
(am from

https://patchwork.kernel.org/patch/13607408/)
(also found at

https://lore.kernel.org/r/20240327202740.3075378-3-swboyd@chromium.org)

BUG=b:319956935, b:330474631
TEST=Boot on lazor
Recovery flow on lazor
UPSTREAM-TASK=b:331677760

Change-Id: I26dd4defa57c0d415e9eade82e284169a0cc1b76
Reviewed-on:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5401609
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Stephen Boyd <swboyd@chromium.org>

M drivers/clk/qcom/clk-rcg2.c

https://chromium-review.googlesource.com/5401609

ap...@google.com <ap...@google.com> #19Mar 28, 2024 04:52PM

Project: chromiumos/third_party/kernel
Branch: chromeos-6.6

commit 9e8799af6f0d52d3495e44982dcf85123112aca0
Author: Stephen Boyd <swboyd@chromium.org>
Date: Wed Mar 27 13:27:37 2024

FROMLIST: clk: qcom: dispcc-sc7180: Force off rotator clk at probe

The 'disp_cc_mdss_rot_clk' is parented to 'disp_cc_pll0' and enabled out
of boot on sc7180 Trogdor devices. This is a problem because the mdss
driver (the driver for compatible node "qcom,sc7180-mdss") turns off
'disp_cc_mdss_mdp_clk' and that clk is also parented to 'disp_cc_pll0'.
When the display pll turns off, the rotator clk gets stuck on.

We don't really care about this clk being on through boot, so simply
disable the clk during driver probe to avoid this issue.

Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Taniya Das <quic_tdas@quicinc.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
(am from

https://patchwork.kernel.org/patch/13607407/)
(also found at

https://lore.kernel.org/r/20240327202740.3075378-2-swboyd@chromium.org)

BUG=b:319956935, b:330474631
TEST=Boot on lazor
Recovery flow on lazor
UPSTREAM-TASK=b:331677760

Change-Id: I08838e7e3691fc1818e3f91c9e72c2d781e3a642
Reviewed-on:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5401608
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Commit-Queue: Stephen Boyd <swboyd@chromium.org>
Commit-Queue: Douglas Anderson <dianders@chromium.org>
Tested-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>

M drivers/clk/qcom/dispcc-sc7180.c

https://chromium-review.googlesource.com/5401608

ap...@google.com <ap...@google.com> #20Mar 29, 2024 06:24PM

Project: chromiumos/third_party/kernel
Branch: release-R124-15823.B-chromeos-6.6

commit f5cce15577634121971841d8eca111baccb99992
Author: Stephen Boyd <swboyd@chromium.org>
Date: Wed Mar 27 13:27:38 2024

FROMLIST: clk: qcom: Properly initialize shared RCGs upon registration

There's two problems with shared RCGs.

The first problem is that they incorrectly report the parent after
commit 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for
parked RCGs"). That's because the cached CFG register value needs to be
populated when the clk is registered. clk_rcg2_shared_enable() writes
the cached CFG register value 'parked_cfg'. This value is initially zero
due to static initializers. If a driver calls clk_enable() first before
setting a rate or parent, it will set the parent to '0' which is
(almost?) always XO, and may not reflect the real parent. In the worst
case, this switches the RCG from sourcing a fast PLL to the slow crystal
speed.

The second problem is that the force enable bit isn't cleared. The force
enable bit is only used during parking and unparking of shared RCGs.
Otherwise it shouldn't be set because it keeps the RCG enabled even when
all the branches on the output of the RCG are disabled (the hardware has
a feedback mechanism so that any child branches keep the RCG enabled
when the branch enable bit is set). This problem wastes power if the clk
is unused, and is harmful in the case that the clk framework disables
the parent of the force enabled RCG. In the latter case, the GDSC the
shared RCG is associated with will get wedged if the RCG's source clk is
disabled and the GDSC tries to enable the RCG to do "housekeeping" while
powering on.

Both of these problems combined with incorrect runtime PM usage in the
display driver lead to a black screen on Qualcomm sc7180 Trogdor
chromebooks. What happens is that the bootloader leaves the
'disp_cc_mdss_rot_clk' enabled and the 'disp_cc_mdss_rot_clk_src' force
enabled and parented to 'disp_cc_pll0'. The mdss driver probes and
runtime suspends, disabling the mdss_gdsc which uses the
'disp_cc_mdss_rot_clk_src' for "housekeeping". The
'disp_cc_mdss_rot_clk' is disabled during late init because the clk is
unused, but the parent 'disp_cc_mdss_rot_clk_src' is still force enabled
because the force enable bit was never cleared. Then 'disp_cc_pll0' is
disabled because it is also unused. That's because the clk framework
believes the parent of the RCG is XO when it isn't. A child device of
the mdss device (e.g. DSI) runtime resumes mdss which powers on the
mdss_gdsc. This wedges the GDSC because 'disp_cc_mdss_rot_clk_src' is
parented to 'disp_cc_pll0' and that PLL is off. With the GDSC wedged,
mdss_runtime_resume() tries to enable 'disp_cc_mdss_mdp_clk' but it
can't because the GDSC has wedged all the clks associated with the GDSC.

disp_cc_mdss_mdp_clk status stuck at 'off'
WARNING: CPU: 1 PID: 81 at drivers/clk/qcom/clk-branch.c:87 clk_branch_toggle+0x114/0x168
Modules linked in:
CPU: 1 PID: 81 Comm: kworker/u16:4 Not tainted 6.7.0-g0dd3ee311255 #1 f5757d475795053fd2ad52247a070cd50dd046f2
Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : clk_branch_toggle+0x114/0x168
lr : clk_branch_toggle+0x110/0x168
sp : ffffffc08084b670
pmr_save: 00000060
x29: ffffffc08084b680 x28: ffffff808006de00 x27: 0000000000000001
x26: ffffff8080dbd4f4 x25: 0000000000000000 x24: 0000000000000000
x23: 0000000000000000 x22: ffffffd838461198 x21: ffffffd838007997
x20: ffffffd837541d5c x19: 0000000000000001 x18: 0000000000000004
x17: 0000000000000000 x16: 0000000000000010 x15: ffffffd837070fac
x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001
x11: c0000000ffffdfff x10: ffffffd838347aa0 x9 : 08dadf92e516c000
x8 : 08dadf92e516c000 x7 : 0000000000000000 x6 : 0000000000000027
x5 : ffffffd8385a61f2 x4 : 0000000000000000 x3 : ffffffc08084b398
x2 : ffffffc08084b3a0 x1 : 00000000ffffdfff x0 : 00000000fffffff0
Call trace:
clk_branch_toggle+0x114/0x168
clk_branch2_enable+0x24/0x30
clk_core_enable+0x5c/0x1c8
clk_enable+0x38/0x58
clk_bulk_enable+0x40/0xb0
mdss_runtime_resume+0x68/0x258
pm_generic_runtime_resume+0x30/0x44
__genpd_runtime_resume+0x30/0x80
genpd_runtime_resume+0x124/0x214
__rpm_callback+0x7c/0x15c
rpm_callback+0x30/0x88
rpm_resume+0x390/0x4d8
rpm_resume+0x43c/0x4d8
__pm_runtime_resume+0x54/0x98
__device_attach+0xe0/0x170
device_initial_probe+0x1c/0x28
bus_probe_device+0x48/0xa4
device_add+0x52c/0x6fc
mipi_dsi_device_register_full+0x104/0x1a8
devm_mipi_dsi_device_register_full+0x28/0x78
ti_sn_bridge_probe+0x1dc/0x2bc
auxiliary_bus_probe+0x4c/0x94
really_probe+0xf8/0x270
__driver_probe_device+0xa8/0x130
driver_probe_device+0x44/0x104
__device_attach_driver+0xa4/0xcc
bus_for_each_drv+0x94/0xe8
__device_attach+0xf8/0x170
device_initial_probe+0x1c/0x28
bus_probe_device+0x48/0xa4
deferred_probe_work_func+0x9c/0xd8
process_scheduled_works+0x1ac/0x4dc
worker_thread+0x110/0x320
kthread+0x100/0x120
ret_from_fork+0x10/0x20

Fixes: 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for parked RCGs")
Reported-by: Stephen Boyd <sboyd@kernel.org>
Closes:

https://lore.kernel.org/r/1290a5a0f7f584fcce722eeb2a1fd898.sboyd@kernel.org
Closes:

https://issuetracker.google.com/319956935
Reported-by: Laura Nao <laura.nao@collabora.com>
Closes:

https://patchwork.kernel.org/patch/13607408/)
(also found at

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5403223
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>

M drivers/clk/qcom/clk-rcg2.c

https://chromium-review.googlesource.com/5403223

ap...@google.com <ap...@google.com> #21Mar 29, 2024 06:24PM

Project: chromiumos/third_party/kernel
Branch: release-R124-15823.B-chromeos-6.6

commit e96f6a88b5e0635135aa645996e00f742e5a1570
Author: Stephen Boyd <swboyd@chromium.org>
Date: Wed Mar 27 13:27:37 2024

FROMLIST: clk: qcom: dispcc-sc7180: Force off rotator clk at probe

The 'disp_cc_mdss_rot_clk' is parented to 'disp_cc_pll0' and enabled out
of boot on sc7180 Trogdor devices. This is a problem because the mdss
driver (the driver for compatible node "qcom,sc7180-mdss") turns off
'disp_cc_mdss_mdp_clk' and that clk is also parented to 'disp_cc_pll0'.
When the display pll turns off, the rotator clk gets stuck on.

We don't really care about this clk being on through boot, so simply
disable the clk during driver probe to avoid this issue.

Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Taniya Das <quic_tdas@quicinc.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
(am from

https://patchwork.kernel.org/patch/13607407/)
(also found at

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5403222
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>

M drivers/clk/qcom/dispcc-sc7180.c

https://chromium-review.googlesource.com/5403222

sw...@google.com <sw...@google.com> #22May 2, 2024 10:49PM

Sent an alternate approach at https://lore.kernel.org/r/20240502224703.103150-1-swboyd@chromium.org which is hopefully more acceptable.

sw...@google.com <sw...@google.com> #23May 14, 2024 10:14PM

Assigned to bu...@google.com.

It's merge window now so this will have to wait until that closes. I hope that Bjorn picks it up shortly after the window closes and lets it soak in linux-next for a while.

bugjuggler: wait 2w

bu...@google.com <bu...@google.com> #24May 14, 2024 10:14PM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait until 2024-05-28 15:14 -0700 PDT and then assign the bug to swboyd@google.com.

ap...@google.com <ap...@google.com> #25May 15, 2024 11:01PM

Project: chromiumos/third_party/kernel
Branch: chromeos-6.6

commit 9ac74b9ae59b8d62d886dad141fbd7f88a9c25e9
Author: Stephen Boyd <swboyd@chromium.org>
Date: Thu May 02 15:47:02 2024

FROMLIST: clk: qcom: Park shared RCGs upon registration

There's two problems with shared RCGs.

The first problem is that they incorrectly report the parent after
commit 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for
parked RCGs"). That's because the cached CFG register value needs to be
populated when the clk is registered. clk_rcg2_shared_enable() writes
the cached CFG register value 'parked_cfg'. This value is initially zero
due to static initializers. If a driver calls clk_enable() before
setting a rate or parent, it will set the parent to '0' which is
(almost?) always XO, and may not reflect the parent at registration. In
the worst case, this switches the RCG from sourcing a fast PLL to the
slow crystal speed.

The second problem is that the force enable bit isn't cleared. The force
enable bit is only used during parking and unparking of shared RCGs.
Otherwise it shouldn't be set because it keeps the RCG enabled even when
all the branches on the output of the RCG are disabled (the hardware has
a feedback mechanism so that any child branches keep the RCG enabled
when the branch enable bit is set). This problem wastes power if the clk
is unused, and is harmful in the case that the clk framework disables
the parent of the force enabled RCG. In the latter case, the GDSC the
shared RCG is associated with will get wedged if the RCG's source clk is
disabled and the GDSC tries to enable the RCG to do "housekeeping" while
powering on.

Both of these problems combined with incorrect runtime PM usage in the
display driver lead to a black screen on Qualcomm sc7180 Trogdor
chromebooks. What happens is that the bootloader leaves the
'disp_cc_mdss_rot_clk' enabled and the 'disp_cc_mdss_rot_clk_src' force
enabled and parented to 'disp_cc_pll0'. The mdss driver probes and
runtime suspends, disabling the mdss_gdsc which uses the
'disp_cc_mdss_rot_clk_src' for "housekeeping". The
'disp_cc_mdss_rot_clk' is disabled during late init because the clk is
unused, but the parent 'disp_cc_mdss_rot_clk_src' is still force enabled
because the force enable bit was never cleared. Then 'disp_cc_pll0' is
disabled because it is also unused. That's because the clk framework
believes the parent of the RCG is XO when it isn't. A child device of
the mdss device (e.g. DSI) runtime resumes mdss which powers on the
mdss_gdsc. This wedges the GDSC because 'disp_cc_mdss_rot_clk_src' is
parented to 'disp_cc_pll0' and that PLL is off. With the GDSC wedged,
mdss_runtime_resume() tries to enable 'disp_cc_mdss_mdp_clk' but it
can't because the GDSC has wedged all the clks associated with the GDSC
causing clks to stay stuck off.

This leads to the following warning seen at boot and a black screen
because the display driver fails to probe.

disp_cc_mdss_mdp_clk status stuck at 'off'
WARNING: CPU: 1 PID: 81 at drivers/clk/qcom/clk-branch.c:87 clk_branch_toggle+0x114/0x168
Modules linked in:
CPU: 1 PID: 81 Comm: kworker/u16:4 Not tainted 6.7.0-g0dd3ee311255 #1 f5757d475795053fd2ad52247a070cd50dd046f2
Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : clk_branch_toggle+0x114/0x168
lr : clk_branch_toggle+0x110/0x168
sp : ffffffc08084b670
pmr_save: 00000060
x29: ffffffc08084b680 x28: ffffff808006de00 x27: 0000000000000001
x26: ffffff8080dbd4f4 x25: 0000000000000000 x24: 0000000000000000
x23: 0000000000000000 x22: ffffffd838461198 x21: ffffffd838007997
x20: ffffffd837541d5c x19: 0000000000000001 x18: 0000000000000004
x17: 0000000000000000 x16: 0000000000000010 x15: ffffffd837070fac
x14: 0000000000000003 x13: 0000000000000004 x12: 0000000000000001
x11: c0000000ffffdfff x10: ffffffd838347aa0 x9 : 08dadf92e516c000
x8 : 08dadf92e516c000 x7 : 0000000000000000 x6 : 0000000000000027
x5 : ffffffd8385a61f2 x4 : 0000000000000000 x3 : ffffffc08084b398
x2 : ffffffc08084b3a0 x1 : 00000000ffffdfff x0 : 00000000fffffff0
Call trace:
clk_branch_toggle+0x114/0x168
clk_branch2_enable+0x24/0x30
clk_core_enable+0x5c/0x1c8
clk_enable+0x38/0x58
clk_bulk_enable+0x40/0xb0
mdss_runtime_resume+0x68/0x258
pm_generic_runtime_resume+0x30/0x44
__genpd_runtime_resume+0x30/0x80
genpd_runtime_resume+0x124/0x214
__rpm_callback+0x7c/0x15c
rpm_callback+0x30/0x88
rpm_resume+0x390/0x4d8
rpm_resume+0x43c/0x4d8
__pm_runtime_resume+0x54/0x98
__device_attach+0xe0/0x170
device_initial_probe+0x1c/0x28
bus_probe_device+0x48/0xa4
device_add+0x52c/0x6fc
mipi_dsi_device_register_full+0x104/0x1a8
devm_mipi_dsi_device_register_full+0x28/0x78
ti_sn_bridge_probe+0x1dc/0x2bc
auxiliary_bus_probe+0x4c/0x94
really_probe+0xf8/0x270
__driver_probe_device+0xa8/0x130
driver_probe_device+0x44/0x104
__device_attach_driver+0xa4/0xcc
bus_for_each_drv+0x94/0xe8
__device_attach+0xf8/0x170
device_initial_probe+0x1c/0x28
bus_probe_device+0x48/0xa4
deferred_probe_work_func+0x9c/0xd8

Fix these problems by parking shared RCGs at boot. This will properly
initialize the parked_cfg struct member so that the parent is reported
properly and ensure that the clk won't get stuck on or off because the
RCG is parented to the safe source (XO).

Fixes: 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for parked RCGs")
Reported-by: Stephen Boyd <sboyd@kernel.org>
Closes:

https://lore.kernel.org/r/1290a5a0f7f584fcce722eeb2a1fd898.sboyd@kernel.org
Closes:

https://issuetracker.google.com/319956935
Reported-by: Laura Nao <laura.nao@collabora.com>
Closes:

https://patchwork.kernel.org/patch/13652182/)
(also found at

https://lore.kernel.org/r/20240502224703.103150-1-swboyd@chromium.org)

BUG=b:319956935
TEST=emerge-trogdor chromeos-kernel-6_6; Boot on Lazor
UPSTREAM-TASK=b:331677760

Change-Id: Ia866b5ac5a1ce87b489ecc3ecf63f3fded97a3c8
Reviewed-on:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5540453
Tested-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sean Paul <sean@poorly.run>
Commit-Queue: Stephen Boyd <swboyd@chromium.org>

M drivers/clk/qcom/clk-rcg2.c

https://chromium-review.googlesource.com/5540453

ap...@google.com <ap...@google.com> #26May 15, 2024 11:01PM

Project: chromiumos/third_party/kernel
Branch: chromeos-6.6

commit 6761a826319bbcaec1d9044578ccb9cd91035bec
Author: Stephen Boyd <swboyd@chromium.org>
Date: Tue May 14 15:20:16 2024

Revert "FROMLIST: clk: qcom: dispcc-sc7180: Force off rotator clk at probe"

This reverts commit 9e8799af6f0d52d3495e44982dcf85123112aca0. We're
going to pick the alternative approach as FROMLIST.

BUG=b:331677760, b:319956935
TEST=emerge-trogdor chromeos-kernel-6_6; Boot on Lazor

Change-Id: I1e45683c8c1a2f0d89614e882dcd5f5ed06fd2c3
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-on:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5540452
Reviewed-by: Douglas Anderson <dianders@chromium.org>

M drivers/clk/qcom/dispcc-sc7180.c

https://chromium-review.googlesource.com/5540452

ap...@google.com <ap...@google.com> #27May 15, 2024 11:01PM

Project: chromiumos/third_party/kernel
Branch: chromeos-6.6

commit 18d3b9ce2bb56c449002ae9c44a41916cc041a22
Author: Stephen Boyd <swboyd@chromium.org>
Date: Tue May 14 15:20:13 2024

Revert "FROMLIST: clk: qcom: Properly initialize shared RCGs upon registration"

This reverts commit 74d088b425d7b58e71c002e4480c2c0d96ce4e99. We're
going to pick the alternative approach as FROMLIST.

BUG=b:331677760, b:319956935
TEST=emerge-trogdor chromeos-kernel-6_6; Boot on Lazor

Cq-Depend: chromium:5540453
Change-Id: I8281d7b027e817cbf80e1593324ac3e816616f3e
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-on:

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/5540451
Reviewed-by: Douglas Anderson <dianders@chromium.org>

M drivers/clk/qcom/clk-rcg2.c

https://chromium-review.googlesource.com/5540451

bu...@google.com <bu...@google.com> May 28, 2024 10:14PM

Assigned to sw...@google.com.

di...@google.com <di...@google.com> #28Jun 3, 2024 03:18PM

So what's the plan here? Is it a good time for the patches to land upstream now?

Also: do we keep this bug open even though we've also got a (non-public) upstream tracking bug at b/331677760 ?

sw...@google.com <sw...@google.com> #29Jun 3, 2024 05:17PM

what's the plan here?

I've asked Bjorn and Dmitry on IRC to test and review the patch on the list. I expect Bjorn to merge it through the qcom clk driver branch so it gets plenty of time to bake in linux-next.

do we keep this bug open

That's my plan. I don't mind having two bugs for basically the same thing.

di...@google.com <di...@google.com> #30Jun 3, 2024 08:07PM

Reassigned to bu...@google.com.

Sounds good. Let's schedule a nag in another 3 weeks then...

bugjuggler: wait 3 weeks

bu...@google.com <bu...@google.com> #31Jun 3, 2024 08:07PM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait until 2024-06-24 13:07 -0700 PDT and then assign the bug to swboyd@google.com.

bu...@google.com <bu...@google.com> Jun 24, 2024 08:07PM

Assigned to sw...@google.com.

di...@google.com <di...@google.com> #32Jul 8, 2024 02:36PM

Reassigned to bu...@google.com.

We're at -rc6 now. This feels like the kind of thing that should show up earlier in the merge window? I guess that means snooze for another 4 weeks?

bugjuggler: wait 4 weeks

bu...@google.com <bu...@google.com> #33Jul 8, 2024 02:36PM

Accepted by bu...@google.com.

Hi. I've received your bug and will wait until 2024-08-05 07:36 -0700 PDT and then assign the bug to swboyd@google.com.

di...@google.com <di...@google.com> #34Jul 8, 2024 07:48PM

Assigned to sw...@google.com.

...or Stephen will just apply it. ;-)

https://lore.kernel.org/r/CAE-0n50VDgsg-4QnynvLOzykr3KP5JsnHqeFPA=uRT3EfgL19g@mail.gmail.com

I guess now we just decide if we want to revert the FROMLIST and reland as FROMGIT or not...

di...@google.com <di...@google.com> #35Aug 7, 2024 11:24PM

Marked as fixed.

Patch IDs match. I guess keeping as FROMLIST is fine. Closing.

# Downstream:
$ git show 01a0a6cc8cfd | git patch-id
2f52f110c3d40a4c25b8657724551ddf400be4b0 01a0a6cc8cfd9952e72677d48d56cf6bc4e3a561

# Upstream:
$ git show 9ac74b9ae59b8d62d886dad141fbd7f88a9c25e9 | git patch-id
2f52f110c3d40a4c25b8657724551ddf400be4b0 9ac74b9ae59b8d62d886dad141fbd7f88a9c25e9

Issue 319956935

Description

Issue summary

Comments

di...@google.com <di...@google.com> Jan 22, 2024 07:54PM

dm...@linaro.org <dm...@linaro.org> #2Jan 23, 2024 12:50AM

sw...@google.com <sw...@google.com> #3Jan 26, 2024 11:41PM

sw...@google.com <sw...@google.com> #4Feb 24, 2024 06:31PM

sw...@google.com <sw...@google.com> #5Mar 26, 2024 10:26PM

sw...@google.com <sw...@google.com> #6Mar 27, 2024 12:07AM

sw...@google.com <sw...@google.com> #7Mar 27, 2024 12:20AM

db...@gmail.com <db...@gmail.com> #8Mar 27, 2024 01:52AM

sw...@google.com <sw...@google.com> #9Mar 27, 2024 02:34AM

sw...@google.com <sw...@google.com> #10Mar 27, 2024 02:39AM

sw...@google.com <sw...@google.com> #11Mar 27, 2024 03:07AM

sw...@google.com <sw...@google.com> #12Mar 27, 2024 03:40AM

sw...@google.com <sw...@google.com> #13Mar 27, 2024 07:00AM

sw...@google.com <sw...@google.com> #14Mar 27, 2024 07:17AM

sw...@google.com <sw...@google.com> #15Mar 27, 2024 08:27PM

sw...@google.com <sw...@google.com> #16Mar 27, 2024 08:28PM

ro...@google.com <ro...@google.com> #17Mar 27, 2024 08:58PM

ap...@google.com <ap...@google.com> #18Mar 28, 2024 04:52PM

ap...@google.com <ap...@google.com> #19Mar 28, 2024 04:52PM

ap...@google.com <ap...@google.com> #20Mar 29, 2024 06:24PM

ap...@google.com <ap...@google.com> #21Mar 29, 2024 06:24PM

sw...@google.com <sw...@google.com> #22May 2, 2024 10:49PM

sw...@google.com <sw...@google.com> #23May 14, 2024 10:14PM

bu...@google.com <bu...@google.com> #24May 14, 2024 10:14PM

ap...@google.com <ap...@google.com> #25May 15, 2024 11:01PM

ap...@google.com <ap...@google.com> #26May 15, 2024 11:01PM

ap...@google.com <ap...@google.com> #27May 15, 2024 11:01PM

bu...@google.com <bu...@google.com> May 28, 2024 10:14PM

di...@google.com <di...@google.com> #28Jun 3, 2024 03:18PM

sw...@google.com <sw...@google.com> #29Jun 3, 2024 05:17PM

di...@google.com <di...@google.com> #30Jun 3, 2024 08:07PM

bu...@google.com <bu...@google.com> #31Jun 3, 2024 08:07PM

bu...@google.com <bu...@google.com> Jun 24, 2024 08:07PM

di...@google.com <di...@google.com> #32Jul 8, 2024 02:36PM

bu...@google.com <bu...@google.com> #33Jul 8, 2024 02:36PM

di...@google.com <di...@google.com> #34Jul 8, 2024 07:48PM

di...@google.com <di...@google.com> #35Aug 7, 2024 11:24PM

Add comment

Issue metadata