Verified
Status Update
Comments
cc...@google.com <cc...@google.com> #2
I wasn't immediately able to reproduce any of this instability locally. Tried running the test in a loop 20 times on different devices, and all 4 benchmarks inside are very stable.
One thing I noticed about ScrollBenchmark is that the noisiness is more extreme the earlier the test in the class:https://screenshot.googleplex.com/0G7pkidU21G
The first two tests are by far the most noisy, while the last (4), is by far the most stable.
One of the noisiest tests (offset) is the most trivial, and least likely to be unstable due to its simplicity. The last, inflateBindOffset, is actually the most complex one. This makes me suspect the issue is that we're seeing interference from other tests that run before it.
The two that come just before are compose, and navigation:https://screenshot.googleplex.com/4uK34BW6nOB I've also verified this order is stable across builds - from host_logs from sponge, ordering is:
- first, sort by small/medium/large
- then, sort by package name.
To test this theory, I have a CL to move ScrollBenchmark to @MediumTest (aosp/1132329), but the implications may be large. If this is an issue of cross-test pollution, this could be a major source of instability in all our benchmarks. +Siyamed and +Filip, who've been looking at unstable Compose benchmarks lately.
One thing I noticed about ScrollBenchmark is that the noisiness is more extreme the earlier the test in the class:
The first two tests are by far the most noisy, while the last (4), is by far the most stable.
One of the noisiest tests (offset) is the most trivial, and least likely to be unstable due to its simplicity. The last, inflateBindOffset, is actually the most complex one. This makes me suspect the issue is that we're seeing interference from other tests that run before it.
The two that come just before are compose, and navigation:
- first, sort by small/medium/large
- then, sort by package name.
To test this theory, I have a CL to move ScrollBenchmark to @MediumTest (aosp/1132329), but the implications may be large. If this is an issue of cross-test pollution, this could be a major source of instability in all our benchmarks. +Siyamed and +Filip, who've been looking at unstable Compose benchmarks lately.
ap...@google.com <ap...@google.com> #3
Project: platform/frameworks/support
Branch: androidx-master-dev
commit ebbae577a3dbd7f923a38a479ef0f513e1dd64c6
Author: Chris Craik <ccraik@google.com>
Date: Fri Oct 04 10:17:47 2019
Label ScrollBenchmark @SmallTest to reorder in CI
Test: none
Bug: 140773023
This will allow us to check if instability in CI is due to other
benchmarks that come before it.
Change-Id: Ic27950c595ed2a5aea2c4bbf5a5afb2ca4b3103a
M recyclerview/recyclerview-benchmark/src/androidTest/java/androidx/recyclerview/benchmark/ScrollBenchmark.kt
https://android-review.googlesource.com/1132329
https://goto.google.com/android-sha1/ebbae577a3dbd7f923a38a479ef0f513e1dd64c6
Branch: androidx-master-dev
commit ebbae577a3dbd7f923a38a479ef0f513e1dd64c6
Author: Chris Craik <ccraik@google.com>
Date: Fri Oct 04 10:17:47 2019
Label ScrollBenchmark @SmallTest to reorder in CI
Test: none
Bug: 140773023
This will allow us to check if instability in CI is due to other
benchmarks that come before it.
Change-Id: Ic27950c595ed2a5aea2c4bbf5a5afb2ca4b3103a
M recyclerview/recyclerview-benchmark/src/androidTest/java/androidx/recyclerview/benchmark/ScrollBenchmark.kt
cc...@google.com <cc...@google.com> #4
The CL correctly reordered the tests, but it didn't seem to impact performance stability, so that means it's almost definitely not the fault of other tests polluting device state, or leaving work around.
I'll keep trying to reproduce. Maybe it's something to do with running the first time after install, or is more device specific.
I'll keep trying to reproduce. Maybe it's something to do with running the first time after install, or is more device specific.
cc...@google.com <cc...@google.com> #5
Done a lot more digging on this, retitled to clarify the pattern here. For modules where the first few benchmarks run quickly (each finish in < 500ms), we have a lot of noise - this is true both for recyclerview:recyclerview-benchmark, but also benchmark:benchmark-benchmark (the trivial and proof-of-concept benchmarks owned by the bench library itself).
The core problem is that we're not actually letting JIT finish during warmup. In recent versions of ART, the jit thread's priority is 129 (numbers are 'niceness', so lower is higher priority), default thread priority is 120.
Our single threaded benchmarks are at default priority, and fully saturate one of the two big, enabled cores (since other cores are disabled). If there's any other background work at the same time that's default priority, it's possible for JIT to get starved and not make significant forward progress on jitting the benchmark itself. This is especially bad if JIT happens to be working on benchmark-irrelevant work (e.g. JUnit or other code outside the loop). Warmup finishes quickly, because it looks like performance is stable.
Working to land a fix that will bump thread priority of both the benchmark, and the jit thread (though not as high as the benchmark). Will keep an eye on CI stability as this lands, and will share the broader stability impact of this fix.
The core problem is that we're not actually letting JIT finish during warmup. In recent versions of ART, the jit thread's priority is 129 (numbers are 'niceness', so lower is higher priority), default thread priority is 120.
Our single threaded benchmarks are at default priority, and fully saturate one of the two big, enabled cores (since other cores are disabled). If there's any other background work at the same time that's default priority, it's possible for JIT to get starved and not make significant forward progress on jitting the benchmark itself. This is especially bad if JIT happens to be working on benchmark-irrelevant work (e.g. JUnit or other code outside the loop). Warmup finishes quickly, because it looks like performance is stable.
Working to land a fix that will bump thread priority of both the benchmark, and the jit thread (though not as high as the benchmark). Will keep an eye on CI stability as this lands, and will share the broader stability impact of this fix.
cc...@google.com <cc...@google.com> #6
Screenshot above shows synchronizedIncrementBenchmark finishing warmup *before* JIT of the relevant benchmark code is done. Result numbers started at e.g. ~1000ns, dropping to ~200ns part way through the resultset.
ca...@google.com <ca...@google.com> #7
+ngeoffray as an FYI
ap...@google.com <ap...@google.com> #8
Project: platform/frameworks/support
Branch: androidx-master-dev
commit 345f86d34f7e373f464c0d8185e392b067a2de4a
Author: Chris Craik <ccraik@google.com>
Date: Wed Oct 09 17:37:17 2019
Bump thread priority of benchmarks and JIT during benchmarks
The JIT thread is so low priority that other parallel tasks can starve
it, especially for the first few benchmarks when a process runs.
The system can spin up significant background work right after install
and/or instrumentation start, and on locked devices with only two big
cores, there aren't enough CPUs to go around - warmup and benchmark
both complete before relevant JIT is complete.
Now, we bump the priority of both the benchmark and JIT thread.
Tracing benchmarks show that the JIT thread goes much faster, which
should significantly reduce the chance we capture results on unjitted
code.
This may also motivate us to use CPU affinity + locked small cores in
the future, we can keep monitoring.
Test: ./gradlew benchmark:b-c:cC
Test: ./gradlew benchmark:b-b:cC
Test: ./gradlew recyclerview:r-b:cC
This CL also adds more logging, and unifies all logging under
"benchmark" tag. This logging was very useful in discovering and
diagnosing the priority problem, since it showed the edge cases where
jit finished *during* the measure pass.
Bug: 140773023
Bug: 142058671
Change-Id: If542e3cb8867165cf7b4688090ee534e68a23562
M benchmark/common/src/androidTest/java/androidx/benchmark/BenchmarkStateTest.kt
M benchmark/common/src/main/java/androidx/benchmark/BenchmarkState.kt
A benchmark/common/src/main/java/androidx/benchmark/ThreadPriority.kt
M benchmark/common/src/main/java/androidx/benchmark/WarmupManager.kt
M benchmark/junit4/src/main/java/androidx/benchmark/junit4/BenchmarkRule.kt
https://android-review.googlesource.com/1138018
https://goto.google.com/android-sha1/345f86d34f7e373f464c0d8185e392b067a2de4a
Branch: androidx-master-dev
commit 345f86d34f7e373f464c0d8185e392b067a2de4a
Author: Chris Craik <ccraik@google.com>
Date: Wed Oct 09 17:37:17 2019
Bump thread priority of benchmarks and JIT during benchmarks
The JIT thread is so low priority that other parallel tasks can starve
it, especially for the first few benchmarks when a process runs.
The system can spin up significant background work right after install
and/or instrumentation start, and on locked devices with only two big
cores, there aren't enough CPUs to go around - warmup and benchmark
both complete before relevant JIT is complete.
Now, we bump the priority of both the benchmark and JIT thread.
Tracing benchmarks show that the JIT thread goes much faster, which
should significantly reduce the chance we capture results on unjitted
code.
This may also motivate us to use CPU affinity + locked small cores in
the future, we can keep monitoring.
Test: ./gradlew benchmark:b-c:cC
Test: ./gradlew benchmark:b-b:cC
Test: ./gradlew recyclerview:r-b:cC
This CL also adds more logging, and unifies all logging under
"benchmark" tag. This logging was very useful in discovering and
diagnosing the priority problem, since it showed the edge cases where
jit finished *during* the measure pass.
Bug: 140773023
Bug: 142058671
Change-Id: If542e3cb8867165cf7b4688090ee534e68a23562
M benchmark/common/src/androidTest/java/androidx/benchmark/BenchmarkStateTest.kt
M benchmark/common/src/main/java/androidx/benchmark/BenchmarkState.kt
A benchmark/common/src/main/java/androidx/benchmark/ThreadPriority.kt
M benchmark/common/src/main/java/androidx/benchmark/WarmupManager.kt
M benchmark/junit4/src/main/java/androidx/benchmark/junit4/BenchmarkRule.kt
cc...@google.com <cc...@google.com> #9
After 15+ runs, recyclerview-benchmark and benchmark-benchmark are *significantly* more stable, see attached screenshots.
There may still be more opportunity here on big/little devices - enabling small cores, and using affinity to keep JIT there, with the main benchmark thread on a big core.
There may still be more opportunity here on big/little devices - enabling small cores, and using affinity to keep JIT there, with the main benchmark thread on a big core.
Description
CLs in build: