Fixed
Status Update
Comments
la...@google.com <la...@google.com> #2
Marking P1 because this is blocking crbug.com/1095661
va...@google.com <va...@google.com> #3
va...@google.com <va...@google.com> #4
when this bug was filed:
* the build was at R86-13320.0.0-34621
* we were running icedtea-bin-3.4.0 for Java
* we were running bazel-0.24.1
here is a recent build failing:
https://ci.chromium.org/p/chromeos/builders/cq/reef-cq/b8872132023285434624 ?
* it has upgraded java (openjdk-bin-11.0.7) which is both significantly newer and a completely different code base
* it is running bazel-3.2.0
that seems to rule out Java itself as the problem. i don't know if we can do a log search to try and narrow down when these builds first started timing out. maybe we can look for logs around/before June 16th for builds that (1) timed out and (2) *don't* have the log line "Completed.*chromeos-base/drivefs". not sure if that is something we can search for ...
* the build was at R86-13320.0.0-34621
* we were running icedtea-bin-3.4.0 for Java
* we were running bazel-0.24.1
here is a recent build failing:
* it has upgraded java (openjdk-bin-11.0.7) which is both significantly newer and a completely different code base
* it is running bazel-3.2.0
that seems to rule out Java itself as the problem. i don't know if we can do a log search to try and narrow down when these builds first started timing out. maybe we can look for logs around/before June 16th for builds that (1) timed out and (2) *don't* have the log line "Completed.*chromeos-base/drivefs". not sure if that is something we can search for ...
mm...@google.com <mm...@google.com> #5
I'm looking at logs and queries, unfortunately a bug in the build api logs not getting stored on step failures
(https://bugs.chromium.org/p/chromium/issues/detail?id=1096784 ) limits the ability to query as easily.
(
sk...@google.com <sk...@google.com> #6
mm...@google.com <mm...@google.com> #7
To compare how logs have changed a bit over time,
here are some older runs that exhibited this problem (from Aug 6...the problem was happening before then too as shown by the first bug entry):
~/CR_OS/chromiumos/chromite$ ./contrib/query_build_logs -s "|-{java}," -w last90days --asc -l 5
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| Stdout0 | BuildId | step_name | DateTime |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},14075 |https://ci.chromium.org/b/8872688175013250672 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-06 20:21:02 |
| |-{java},14078 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},16610 |https://ci.chromium.org/b/8872687565517622080 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-06 20:37:55 |
| |-{java},16612 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},17865 |https://ci.chromium.org/b/8872686473790083824 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-06 20:44:46 |
| |-{java},17866 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},15534 |https://ci.chromium.org/b/8872672040562132640 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-07 00:45:31 |
| |-{java},15539 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},25516 |https://ci.chromium.org/b/8872645477857588784 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-07 07:39:22 |
| |-{java},25528 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
5 rows in result set (59.65 sec)
And here are 5 recent ones:
./contrib/query_build_logs -s "|-{java}," -w last90days -l 5
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| Stdout0 | BuildId | step_name | DateTime |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},15053 |https://ci.chromium.org/b/8872084226000208080 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-13 12:27:35 |
| |-{java},15056 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},11600 |https://ci.chromium.org/b/8872083893366072784 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-13 12:23:38 |
| |-{java},11602 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},21233 |https://ci.chromium.org/b/8872102534838247056 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-13 09:34:36 |
| |-{java},21240 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},26191 |https://ci.chromium.org/b/8872120931947973072 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-13 02:43:27 |
| |-{java},26196 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},18698 |https://ci.chromium.org/b/8872124395785659504 | install packages|call chromite.api.SysrootService/InstallPackages|call build API script | 2020-08-13 01:34:18 |
| |-{java},18708 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
5 rows in result set (59.66 sec)
Scan rate: 0.25M rows/sec, SWE cost: 0.290206s
here are some older runs that exhibited this problem (from Aug 6...the problem was happening before then too as shown by the first bug entry):
~/CR_OS/chromiumos/chromite$ ./contrib/query_build_logs -s "|-{java}," -w last90days --asc -l 5
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| Stdout0 | BuildId | step_name | DateTime |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},14075 |
| |-{java},14078 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},16610 |
| |-{java},16612 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},17865 |
| |-{java},17866 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},15534 |
| |-{java},15539 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},25516 |
| |-{java},25528 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
5 rows in result set (59.65 sec)
And here are 5 recent ones:
./contrib/query_build_logs -s "|-{java}," -w last90days -l 5
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| Stdout0 | BuildId | step_name | DateTime |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},15053 |
| |-{java},15056 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},11600 |
| |-{java},11602 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},21233 |
| |-{java},21240 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},26191 |
| |-{java},26196 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
| |-{java},18698 |
| |-{java},18708 | | | |
| | | | |
+------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+---------------------+
5 rows in result set (59.66 sec)
Scan rate: 0.25M rows/sec, SWE cost: 0.290206s
va...@google.com <va...@google.com> #8
we only have 2 packages using bazel atm: tensorflow & drivefs. only drivefs is hanging in this way. not sure if tensorflow is built for every device like drivefs is.
sb...@google.com <sb...@google.com> #9
One thought is that Bazel is forking off its server, which sticks around? That would be easiest to control via the `--max_idle_secs` flag.
https://g3doc.corp.google.com/devtools/blaze/g3doc/guide.md?cl=head#client/server
sb...@google.com <sb...@google.com> #11
Re. #8: Or the "--batch" flag, ditto.
I see bazel.eclass doing this (the bazelrc it creates contains a "startup --batch" line), and cros_bazel.eclass also does that, so assuming they use those, that's not the culprit (maybe confirm with a log, to see exactly what options were passed to Bazel?)
I see bazel.eclass doing this (the bazelrc it creates contains a "startup --batch" line), and cros_bazel.eclass also does that, so assuming they use those, that's not the culprit (maybe confirm with a log, to see exactly what options were passed to Bazel?)
va...@google.com <va...@google.com> #12
the bazel eclass uses --spawn_strategy=standalone (aka "local"):
https://chromium.googlesource.com/chromiumos/overlays/portage-stable/+/12b28ea8c5da956be348109ae945a856b5a13f12/eclass/bazel.eclass#106
https://docs.bazel.build/versions/master/user-manual.html#flag--spawn_strategy
the drivefs ebuild does:
src_compile() {
ebazel build //${DRIVEFS_DIR}/chromeos:drivefs
ebazel shutdown
}
the log suggests we don't make it to the shutdown call. we can try adding --max_idle_secs to see if it helps.
the drivefs ebuild does:
src_compile() {
ebazel build //${DRIVEFS_DIR}/chromeos:drivefs
ebazel shutdown
}
the log suggests we don't make it to the shutdown call. we can try adding --max_idle_secs to see if it helps.
va...@google.com <va...@google.com> #13
i poked it a bit, and we're using the --batch option:
https://chromium.googlesource.com/chromiumos/overlays/portage-stable/+/12b28ea8c5da956be348109ae945a856b5a13f12/eclass/bazel.eclass#94
https://docs.bazel.build/versions/master/user-manual.html#flag--batch
this disables the client/server model entirely, so that should be handled already.
this disables the client/server model entirely, so that should be handled already.
va...@google.com <va...@google.com> #16
we don't need links to more logs at this point. we know it's on-going, and Michael has a script to help us locate more if needed (see comment #6 ).
va...@google.com <va...@google.com> #17
i'll try disabling the Gentoo sandbox to see if that is happening to mess things up. i hate doing this, but bazel has its own sandbox which should provide its own guarantees for protecting the rest of the system.
sa...@google.com <sa...@google.com> #18
cj...@google.com <cj...@google.com> #19
Dropping the priority on this because it does not seem to be happening as often.
cj...@google.com <cj...@google.com> #20
I feel very lucky: In a completely unrelated discussion I was made aware of a bug that I think is the root cause of this issue.
I believe the cause of this hang is a bug in glibc!https://sourceware.org/bugzilla/show_bug.cgi?id=25847
The summary of that bug is that it is a *confirmed issue* that under certain conditions threads waiting on a condvar will deadlock and never wake up.
I believe the cause of this hang is a bug in glibc!
The summary of that bug is that it is a *confirmed issue* that under certain conditions threads waiting on a condvar will deadlock and never wake up.
ma...@google.com <ma...@google.com> #21
Thanks Chris,
Marking this as external dependency and setting up a NextAction date for Nov. I'll consult with internal glibc experts meanwhile.
Marking this as external dependency and setting up a NextAction date for Nov. I'll consult with internal glibc experts meanwhile.
sc...@google.com <sc...@google.com> #22
The NextAction date has arrived: 2020-11-02
mm...@google.com <mm...@google.com> #23
According to https://sourceware.org/bugzilla/show_bug.cgi?id=25847 ,
it looks like the bug has been fixed -- I will follow up in our next Build team meeting
to see if we need to take any action to have fixed version of Bazel.
I'm fairly confident that Qin Li's patch stops the deadlock from happening (see my previous comment). I understand it may not be the maximally efficient fix but I consider not deadlocking to outweigh any possible inefficiency introduced only in this rather edge, lock stealing case.
Given how hard this bug is to isolate (it took about two man-months on my side), and that other orgs have gone through and are going through similar experiences, it seems right to apply the fix now and discuss improving the performance of the fix in a new bug.
From Malte's excellent blog post it appears people have been struggling with this since at least 2019https://github.com/mratsim/weave/issues/56
it looks like the bug has been fixed -- I will follow up in our next Build team meeting
to see if we need to take any action to have fixed version of Bazel.
I'm fairly confident that Qin Li's patch stops the deadlock from happening (see my previous comment). I understand it may not be the maximally efficient fix but I consider not deadlocking to outweigh any possible inefficiency introduced only in this rather edge, lock stealing case.
Given how hard this bug is to isolate (it took about two man-months on my side), and that other orgs have gone through and are going through similar experiences, it seems right to apply the fix now and discuss improving the performance of the fix in a new bug.
From Malte's excellent blog post it appears people have been struggling with this since at least 2019
va...@google.com <va...@google.com> #25
it would be nice if we could merge something on our side ahead of time. if we're very shy of breaking devices, then it could at least be put behind `use cros_host` so our builders stop choking.
ma...@google.com <ma...@google.com> #26
Currently the patch seems to be incomplete or needs changes as per upstream discussion: https://sourceware.org/pipermail/libc-alpha/2020-December/121106.html . Let's wait and see if there is something newer this month.
sc...@google.com <sc...@google.com> #27
The NextAction date has arrived: 2021-02-01
va...@google.com <va...@google.com> #28
ma...@google.com <ma...@google.com> #29
Assigning to Adrian(collabora),
I do not see any commits inhttps://sourceware.org/bugzilla/show_bug.cgi?id=25847 , So I am not sure if there is an upstream fix for it yet.
If there is no official fix yet, can you please cherrypick the ubuntu patch :https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800/comments/1 ? There is a reproducer in the ubuntu bug as well.
I do not see any commits in
If there is no official fix yet, can you please cherrypick the ubuntu patch :
ad...@collabora.corp-partner.google.com <ad...@collabora.corp-partner.google.com> #30
@Manoj & all
The Ubuntu patch is a safer mitigation while the upstream glibc bug report contains two proposed fixes (one in Nov 2020, another in May 2021) which are quite complex.
I built and run the glibc reproducer pointed to from Launchpad (https://sourceware.org/bugzilla/attachment.cgi?id=12480 ) and apparently it works against our 2.32 tree.
My recommendation is to use the safer Ubuntu mitigation until a proper fix lands in glibc upstream AND is tested in a released glibc for a while.
Until then we carry the simple one- liner mitigation as a local patch in our glibc ebuild.
Will post a CL.
The Ubuntu patch is a safer mitigation while the upstream glibc bug report contains two proposed fixes (one in Nov 2020, another in May 2021) which are quite complex.
I built and run the glibc reproducer pointed to from Launchpad (
My recommendation is to use the safer Ubuntu mitigation until a proper fix lands in glibc upstream AND is tested in a released glibc for a while.
Until then we carry the simple one- liner mitigation as a local patch in our glibc ebuild.
Will post a CL.
ap...@google.com <ap...@google.com> #31
Project: chromiumos/overlays/chromiumos-overlay
Branch: main
commit 1090d13529c2fe2815cbdbb3831151879836c553
Author: Adrian Ratiu <adrian.ratiu@collabora.corp-partner.google.com>
Date: Fri Jul 09 14:31:58 2021
sys-libs/glibc: add mitigation for pthread_cond_signal deadlock
This mitigation is applied until the bug reported at [1]
gets a proper fix in glibc upstream. Work is on-going and
the proposed fixes are more complex than this mitigation,
so for now use it to avoid the deadlock because it safest.
The mitigation is at location [2] in the above bug report
and was also tested and deployed in Ubuntu starting with
Bionic onward [3].
[1]https://sourceware.org/bugzilla/show_bug.cgi?id=25847
[2]https://sourceware.org/bugzilla/attachment.cgi?id=12484
[3]https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800
BUG=b:187793042
TEST=Local builds; above reproducer test.
Change-Id: I0f8e6b8606de00e023e1f5b2ec0124882cbb726e
Reviewed-on:https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/3016193
Reviewed-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>
Tested-by: Luis Lozano <llozano@chromium.org>
Commit-Queue: Manoj Gupta <manojgupta@chromium.org>
A sys-libs/glibc/files/local/glibc-2.32/0012-nptl-mitigate-failed-pthread_cond_signal-wakeup.patch
M sys-libs/glibc/glibc-2.32-r15.ebuild
https://chromium-review.googlesource.com/3016193
Branch: main
commit 1090d13529c2fe2815cbdbb3831151879836c553
Author: Adrian Ratiu <adrian.ratiu@collabora.corp-partner.google.com>
Date: Fri Jul 09 14:31:58 2021
sys-libs/glibc: add mitigation for pthread_cond_signal deadlock
This mitigation is applied until the bug reported at [1]
gets a proper fix in glibc upstream. Work is on-going and
the proposed fixes are more complex than this mitigation,
so for now use it to avoid the deadlock because it safest.
The mitigation is at location [2] in the above bug report
and was also tested and deployed in Ubuntu starting with
Bionic onward [3].
[1]
[2]
[3]
BUG=b:187793042
TEST=Local builds; above reproducer test.
Change-Id: I0f8e6b8606de00e023e1f5b2ec0124882cbb726e
Reviewed-on:
Reviewed-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>
Tested-by: Luis Lozano <llozano@chromium.org>
Commit-Queue: Manoj Gupta <manojgupta@chromium.org>
A sys-libs/glibc/files/local/glibc-2.32/0012-nptl-mitigate-failed-pthread_cond_signal-wakeup.patch
M sys-libs/glibc/glibc-2.32-r15.ebuild
Description
looks like bazel is hanging for some reason
# pstree -p -s -a 4559 -A
systemd,1
`-python,2067 /b/s/swarming_bot.2.zip start_bot
`-python,4451 /b/s/swarming_bot.2.zip task_runner --swarming-server ...
`-python,4461 -u /b/s/swarming_bot.2.zip run_isolated -a /b/s/w/run_isolated_args.json
`-kitchen,4524 cook -buildbucket-hostname
`-bash,4559 /b/s/w/ir/kitchen-checkout/recipes run --properties-file /b/s/w/ir/tmp/rr/properties.json ...
`-python,4560 -u -s /b/s/w/ir/kitchen-checkout/recipe_engine/recipe_engine/main.py --package/b/s/w/ir/k
`-tee_wrapper.sh,5108 /b/s/w/ir/kitchen-checkout/chromeos/recipe_scripts/tee_wrapper.sh...
|-python3,5109 /b/s/w/ir/k/chromiumos_workspace/chromite/bin/build_api --input-json...
| `-sudo,5112 USE=chrome_internal CROS_CACHEDIR=/b/s/w/ir/k/chromiumos_workspace/.cache...
| `-python3,5114 /b/s/w/ir/k/chromiumos_workspace/chromite/bin/cros_sdk --chroot ...
| `-python3,5120 /b/s/w/ir/k/chromiumos_workspace/chromite/bin/cros_sdk --chroot ...
| `-python3,5123 /b/s/w/ir/k/chromiumos_workspace/chromite/bin/cros_sdk --chroot ...
| `-python3,5127 /mnt/host/source/chromite/bin/build_api ...
| `-build_packages,5409 /mnt/host/source/src/scripts/build_packages --board ...
| `-build_packages,8700 /mnt/host/source/src/scripts/build_packages ...
| `-emerge,8701 -b /usr/lib/python-exec/python3.6/emerge -uDNv ...
| `-sandbox,31864 /usr/lib/portage/python3.6/ebuild.sh compile
| `-ebuild.sh,31914 /usr/lib/portage/python3.6/ebuild.sh ...
| `-ebuild.sh,32171 ...
| `-java,1086 -XX:+HeapDumpOnOutOfMemoryError...
| |-{java},6465
| |-{java},6514
| |-{java},6531
| |-{java},6543
| |-{java},6558
| |-{java},6569
| |-{java},6578
| |-{java},6592
| |-{java},6610
| |-{java},6628
| |-{java},6637
| |-{java},6641
| |-{java},6646
| |-{java},6657
| |-{java},6672
| |-{java},6681
| |-{java},6693
| |-{java},6702
| |-{java},6712
| |-{java},6727
| |-{java},6735
| |-{java},6749
| |-{java},6775
| |-{java},6782
| |-{java},6883
| |-{java},6916
| |-{java},6930
| |-{java},7286
| |-{java},7306
| |-{java},7319
| |-{java},7332
| |-{java},7334
| |-{java},7340
| |-{java},7383
| |-{java},7396
| |-{java},7406
| |-{java},7420
| |-{java},7438
| |-{java},7448
| |-{java},7460
| |-{java},7469
| |-{java},7477
| |-{java},7500
| |-{java},7536
| |-{java},7556
| |-{java},14501
| |-{java},24841
| |-{java},24853
| |-{java},24854
| |-{java},24857
| |-{java},24859
| |-{java},24860
| |-{java},24861
| |-{java},24862
| |-{java},24863
| |-{java},24871
| |-{java},24872
| |-{java},24873
| |-{java},24875
| |-{java},24876
| |-{java},24878
| |-{java},24881
| |-{java},24893
| |-{java},24894
| |-{java},24895
| |-{java},24896
| |-{java},24897
| |-{java},24899
| |-{java},24900
| |-{java},24901
| |-{java},24902
| |-{java},24904
| |-{java},24905
| |-{java},24908
| |-{java},24909
| |-{java},24912
| |-{java},24913
| |-{java},24915
| `-{java},24921
`-tee,5110 /b/s/w/ir/k/recipe_cleanup/build_api_messages1_Zf4z/build_log.txt
# strace -p 1086
strace: Process 1086 attached
futex(0x7fd68bd229d0, FUTEX_WAIT, 19678, NULL
<it's hung here>
using gdb on the process:
(gdb) bt
#0 0x00007fd68b1ce8dd in __pthread_timedjoin_ex () from /b/s/w/ir/cache/cros_chroot/lib64/libpthread.so.0
#1 0x00007fd68b9a4fa5 in ?? () from /b/s/w/ir/cache/cros_chroot/opt/icedtea-bin-3.4.0/bin/../lib/amd64/jli/libjli.so
#2 0x00007fd68b9a15b4 in ?? () from /b/s/w/ir/cache/cros_chroot/opt/icedtea-bin-3.4.0/bin/../lib/amd64/jli/libjli.so
#3 0x00007fd68b9a2050 in JLI_Launch () from /b/s/w/ir/cache/cros_chroot/opt/icedtea-bin-3.4.0/bin/../lib/amd64/jli/libjli.so
#4 0x00000000004005b1 in ?? () /b/s/w/ir/cache/cros_chroot/opt/icedtea-bin-3.4.0/bin/java
#5 0x00007fd68b609a8e in __libc_start_main () from /b/s/w/ir/cache/cros_chroot/lib64/libc.so.6
#6 0x00000000004005e9 in ?? ()