Fixed
Status Update
Comments
ra...@google.com <ra...@google.com> #2
Yigit, do you have time to fix it?
reemission of the same liveData is racy
reemission of the same liveData is racy
ra...@google.com <ra...@google.com> #3
yea i'll take it.
ap...@google.com <ap...@google.com> #4
Thanks for the detailed analysis. This may not be an issue anymore since we've started using Main.immediate there but I' not sure; I'll try to create a test case.
pe...@gmail.com <pe...@gmail.com> #5
just emitting same live data reproduces the issue.
@Test
fun raceTest() {
val subLiveData = MutableLiveData(1)
val subject = liveData(testScope.coroutineContext) {
emitSource(subLiveData)
emitSource(subLiveData) //crashes
}
subject.addObserver().apply {
testScope.advanceUntilIdle()
}
}
@Test
fun raceTest() {
val subLiveData = MutableLiveData(1)
val subject = liveData(testScope.coroutineContext) {
emitSource(subLiveData)
emitSource(subLiveData) //crashes
}
subject.addObserver().apply {
testScope.advanceUntilIdle()
}
}
ra...@google.com <ra...@google.com> #6
With 2.2.0-alpha04 (that use Main.immediate), the issue seems to be still there (I tested it by calling emitSource() twice, like your test case)
pe...@gmail.com <pe...@gmail.com> #7
yea sorry immediate does not fix it.
I actually have a WIP fix for it:
https://android-review.googlesource.com/c/platform/frameworks/support/+/1112186
if your case is the one i found (emitting same LiveData multiple times, as shown in #5) you can work around it by adding a dummy transformation.
val subLiveData = MutableLiveData(1)
val subject = liveData(testScope.coroutineContext) {
emitSource(subLiveData.map {it })
emitSource(subLiveData.map {it} )
}
I actually have a WIP fix for it:
if your case is the one i found (emitting same LiveData multiple times, as shown in #5) you can work around it by adding a dummy transformation.
val subLiveData = MutableLiveData(1)
val subject = liveData(testScope.coroutineContext) {
emitSource(subLiveData.map {it })
emitSource(subLiveData.map {it} )
}
pe...@gmail.com <pe...@gmail.com> #8
Project: platform/frameworks/support
Branch: androidx-master-dev
commit af12e75e6b4110f48e44ca121466943909de8f06
Author: Yigit Boyar <yboyar@google.com>
Date: Tue Sep 03 12:58:11 2019
Fix coroutine livedata race condition
This CL fixes a bug in liveData builder where emitting same
LiveData source twice would make it crash because the second
emission registry could possibly happen before first one is
removed as source.
We fix it by using a suspending dispose function. It does feel
a bit hacky but we cannot make DisposableHandle.dispose async
and we do not want to block there. This does not mean that there
is a problem if developer disposes it manually since our emit
functions take care of making sure it disposes (and there is
no other way to add source to the underlying MediatorLiveData)
Bug: 140249349
Test: BuildLiveDataTest#raceTest_*
Change-Id: I0b464c242a583da4669af195cf2504e2adc4de40
M lifecycle/lifecycle-livedata-ktx/api/2.2.0-alpha05.txt
M lifecycle/lifecycle-livedata-ktx/api/current.txt
M lifecycle/lifecycle-livedata-ktx/api/public_plus_experimental_2.2.0-alpha05.txt
M lifecycle/lifecycle-livedata-ktx/api/public_plus_experimental_current.txt
M lifecycle/lifecycle-livedata-ktx/api/restricted_2.2.0-alpha05.txt
M lifecycle/lifecycle-livedata-ktx/api/restricted_current.txt
M lifecycle/lifecycle-livedata-ktx/src/main/java/androidx/lifecycle/CoroutineLiveData.kt
M lifecycle/lifecycle-livedata-ktx/src/test/java/androidx/lifecycle/BuildLiveDataTest.kt
https://android-review.googlesource.com/1112186
https://goto.google.com/android-sha1/af12e75e6b4110f48e44ca121466943909de8f06
Branch: androidx-master-dev
commit af12e75e6b4110f48e44ca121466943909de8f06
Author: Yigit Boyar <yboyar@google.com>
Date: Tue Sep 03 12:58:11 2019
Fix coroutine livedata race condition
This CL fixes a bug in liveData builder where emitting same
LiveData source twice would make it crash because the second
emission registry could possibly happen before first one is
removed as source.
We fix it by using a suspending dispose function. It does feel
a bit hacky but we cannot make DisposableHandle.dispose async
and we do not want to block there. This does not mean that there
is a problem if developer disposes it manually since our emit
functions take care of making sure it disposes (and there is
no other way to add source to the underlying MediatorLiveData)
Bug: 140249349
Test: BuildLiveDataTest#raceTest_*
Change-Id: I0b464c242a583da4669af195cf2504e2adc4de40
M lifecycle/lifecycle-livedata-ktx/api/2.2.0-alpha05.txt
M lifecycle/lifecycle-livedata-ktx/api/current.txt
M lifecycle/lifecycle-livedata-ktx/api/public_plus_experimental_2.2.0-alpha05.txt
M lifecycle/lifecycle-livedata-ktx/api/public_plus_experimental_current.txt
M lifecycle/lifecycle-livedata-ktx/api/restricted_2.2.0-alpha05.txt
M lifecycle/lifecycle-livedata-ktx/api/restricted_current.txt
M lifecycle/lifecycle-livedata-ktx/src/main/java/androidx/lifecycle/CoroutineLiveData.kt
M lifecycle/lifecycle-livedata-ktx/src/test/java/androidx/lifecycle/BuildLiveDataTest.kt
ra...@google.com <ra...@google.com> #9
No, you don't have to submit new Workers. The Worker will go from an ENQUEUED state to a RUNNING state eventually (after back-off). I just tested it using the same test app you attached before.
ra...@google.com <ra...@google.com> #10
If you turn on logging, you should see something like this:
02-21 14:47:15.264 8191 8207 D WM-ForceStopRunnable: Unfinished Workers exist, rescheduling.
02-21 14:47:15.414 8191 8191 D WM-SystemJobService: onStartJob for d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:15.417 8191 8207 D WM-Processor: Processor: processing d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:15.418 8191 8207 D WM-WorkerWrapper: Delaying execution for com.perracolabs.workmanagerbug.WorkerTest because it is being executed before schedule. [ Delay ]
02-21 14:47:15.583 8191 8191 D WM-Processor: Processor d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed; reschedule = true
02-21 14:47:15.584 8191 8191 D WM-SystemJobService: d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed on JobScheduler
02-21 14:47:25.647 8191 8191 D WM-SystemJobService: onStartJob for d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.648 8191 8207 D WM-Processor: Processor: processing d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.665 8191 8191 D WM-WorkerWrapper: Starting work for com.perracolabs.workmanagerbug.WorkerTest
02-21 14:47:25.671 8191 8207 D WM-WorkerWrapper: com.perracolabs.workmanagerbug.WorkerTest returned a Success {mOutputData=androidx.work.Data@0} result.
02-21 14:47:25.672 8191 8207 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=d7ce98ce-3d3d-401e-8812-a49430b26fc1, tags={ com.perracolabs.workmanagerbug.WorkerTest } ]
02-21 14:47:25.673 8191 8207 D WM-PackageManagerHelper: androidx.work.impl.background.systemalarm.RescheduleReceiver disabled
02-21 14:47:25.676 8191 8191 D WM-Processor: Processor d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed; reschedule = false
02-21 14:47:25.676 8191 8191 D WM-SystemJobService: d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed on JobScheduler
02-21 14:47:25.677 8191 8207 D WM-GreedyScheduler: Cancelling work ID d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.679 8191 8207 D WM-Processor: Processor stopping d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.679 8191 8207 D WM-Processor: WorkerWrapper could not be found for d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.679 8191 8207 D WM-StopWorkRunnable: StopWorkRunnable for d7ce98ce-3d3d-401e-8812-a49430b26fc1; Processor.stopWork = false
02-21 14:47:15.264 8191 8207 D WM-ForceStopRunnable: Unfinished Workers exist, rescheduling.
02-21 14:47:15.414 8191 8191 D WM-SystemJobService: onStartJob for d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:15.417 8191 8207 D WM-Processor: Processor: processing d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:15.418 8191 8207 D WM-WorkerWrapper: Delaying execution for com.perracolabs.workmanagerbug.WorkerTest because it is being executed before schedule. [ Delay ]
02-21 14:47:15.583 8191 8191 D WM-Processor: Processor d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed; reschedule = true
02-21 14:47:15.584 8191 8191 D WM-SystemJobService: d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed on JobScheduler
02-21 14:47:25.647 8191 8191 D WM-SystemJobService: onStartJob for d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.648 8191 8207 D WM-Processor: Processor: processing d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.665 8191 8191 D WM-WorkerWrapper: Starting work for com.perracolabs.workmanagerbug.WorkerTest
02-21 14:47:25.671 8191 8207 D WM-WorkerWrapper: com.perracolabs.workmanagerbug.WorkerTest returned a Success {mOutputData=androidx.work.Data@0} result.
02-21 14:47:25.672 8191 8207 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=d7ce98ce-3d3d-401e-8812-a49430b26fc1, tags={ com.perracolabs.workmanagerbug.WorkerTest } ]
02-21 14:47:25.673 8191 8207 D WM-PackageManagerHelper: androidx.work.impl.background.systemalarm.RescheduleReceiver disabled
02-21 14:47:25.676 8191 8191 D WM-Processor: Processor d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed; reschedule = false
02-21 14:47:25.676 8191 8191 D WM-SystemJobService: d7ce98ce-3d3d-401e-8812-a49430b26fc1 executed on JobScheduler
02-21 14:47:25.677 8191 8207 D WM-GreedyScheduler: Cancelling work ID d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.679 8191 8207 D WM-Processor: Processor stopping d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.679 8191 8207 D WM-Processor: WorkerWrapper could not be found for d7ce98ce-3d3d-401e-8812-a49430b26fc1
02-21 14:47:25.679 8191 8207 D WM-StopWorkRunnable: StopWorkRunnable for d7ce98ce-3d3d-401e-8812-a49430b26fc1; Processor.stopWork = false
pe...@gmail.com <pe...@gmail.com> #11
I'm still able to reproduce it but randomly. Most of the times seems to honor the back-off criteria, and other times the workers get permanently stuck with several in a BLOCKED state and only 1 in a ENQUEUED state.
When stuck, if I kill the app via the recent apps history, then after reopening it everything gets processed correctly.
I thought that maybe was that since in my test I send several Workers, then maybe the back-off criteria were getting accumulated and hence the delay could become several minutes. But removing the app and re-installing to start a clean test again shows that even after several minutes they don't get processed.
I've also noticed that if when crashed instead of pressing the "Open app again" option, I choose "Close app", immediately reopen the app through the launcher and send new workers before the current stuck ones gets processed, then there are more chances to be reproduced.
Will update when I find proper pattern, unless I can prove to myself that I'm doing something wrong.
When stuck, if I kill the app via the recent apps history, then after reopening it everything gets processed correctly.
I thought that maybe was that since in my test I send several Workers, then maybe the back-off criteria were getting accumulated and hence the delay could become several minutes. But removing the app and re-installing to start a clean test again shows that even after several minutes they don't get processed.
I've also noticed that if when crashed instead of pressing the "Open app again" option, I choose "Close app", immediately reopen the app through the launcher and send new workers before the current stuck ones gets processed, then there are more chances to be reproduced.
Will update when I find proper pattern, unless I can prove to myself that I'm doing something wrong.
ra...@google.com <ra...@google.com> #12
You should enable logging in your app, and that will tell you a lot more about what might be going on. Look at how to initialize WorkManager with a custom Configuration.
When I test thing, the Workers always eventually transition to RUNNING.
Send us the logs, if you still find something odd.
When I test thing, the Workers always eventually transition to RUNNING.
Send us the logs, if you still find something odd.
pe...@gmail.com <pe...@gmail.com> #13
Please find attached a log file.
As you will see in the log the first time the chain gets correctly processed, then on the second crash they got stuck. I waited for several minutes, yet they didn't get processed.
The last line of the log (shown next) maybe is related to the problem.
2019-02-22 01:36:39.069 766-766/? W/JobServiceContext: Time-out while trying to bind 82dcb4 #u0a29/979 com.perracolabs.workmanagerbug/androidx.work.impl.background.systemjob.SystemJobService, dropping.
As you will see in the log the first time the chain gets correctly processed, then on the second crash they got stuck. I waited for several minutes, yet they didn't get processed.
The last line of the log (shown next) maybe is related to the problem.
2019-02-22 01:36:39.069 766-766/? W/JobServiceContext: Time-out while trying to bind 82dcb4 #u0a29/979 com.perracolabs.workmanagerbug/androidx.work.impl.background.systemjob.SystemJobService, dropping.
su...@google.com <su...@google.com> #14
Where are you initializing your WorkManager?
pe...@gmail.com <pe...@gmail.com> #15
For this test the first time the WorkManager.getInstance() gets called in the activity onCreate() method in order to place the LiveData observers. So I assume this is the moment when it gets initialized. Though WorkManager.getInstance() is also called every time the test execution button is pressed in its onClick method.
ra...@google.com <ra...@google.com> #16
I tried to reproduce this problem again, and its seems that I can get into this state easily on Android P (relative to other versions of the platform).
After looking into this further, this is what is going on; After the app process crashes more than a few times, ActivityManager designates that the app as "bad". JobScheduler tries to bind to WorkManager's JobService via ActivityManager to start a job, and times out because ActivityManager considers this app "bad" the will not allow binds to the JobService.
At this point JobScheduler drops the job (a bug) and hence the Worker is still ENQUEUED.
At some point in the future (after a subsequent process restart / more Workers being enqueued) this Worker will eventually execute because a new job is created by WorkManager (because WorkManager discovers a Worker does not have a backing job).
I am going to file a bug with JobScheduler to better handle this edge case. However, this is not something WorkManager can work around. Also, your work is not blocked indefinitely. It will eventually transition to a RUNNING state.
After looking into this further, this is what is going on; After the app process crashes more than a few times, ActivityManager designates that the app as "bad". JobScheduler tries to bind to WorkManager's JobService via ActivityManager to start a job, and times out because ActivityManager considers this app "bad" the will not allow binds to the JobService.
At this point JobScheduler drops the job (a bug) and hence the Worker is still ENQUEUED.
At some point in the future (after a subsequent process restart / more Workers being enqueued) this Worker will eventually execute because a new job is created by WorkManager (because WorkManager discovers a Worker does not have a backing job).
I am going to file a bug with JobScheduler to better handle this edge case. However, this is not something WorkManager can work around. Also, your work is not blocked indefinitely. It will eventually transition to a RUNNING state.
pe...@gmail.com <pe...@gmail.com> #17
This makes sense. I guess that for a production app will not be a problem as an app that crashes constantly isn't something realistic, and forcing it in a test project is logical that eventually can get reproduced with none realistic conditions.
Said that, without forcing the crash scenario the initial issue I reported seems to be resolved following a normal use-case. Though maybe some minimal notes could be added to the documentation to state that such possibility may happen while developing, as I discovered the initial issue while debugging inside a Worker.doWork method, killed the app process to re-start debugging again and found that the chain wasn't getting processed.
Thanks for the update.
Said that, without forcing the crash scenario the initial issue I reported seems to be resolved following a normal use-case. Though maybe some minimal notes could be added to the documentation to state that such possibility may happen while developing, as I discovered the initial issue while debugging inside a Worker.doWork method, killed the app process to re-start debugging again and found that the chain wasn't getting processed.
Thanks for the update.
jo...@truecaller.com <jo...@truecaller.com> #18
Hi, I would like to add some more details about the crash we started to experience after we updated to WorkManager 1.0.0. I think this crash is related to the problem described in this thread especially in the end of it where and we verified that it had direct impact on our users after we shipped another version where we had to downgrade WorkManager to 1.0.0-rc01
In our app we have many users (120M+ daily active users) running our app on different OS. After we released WorkManager 1.0.0 we got a number of users (not many users overall) but generating very high amount of crashers.
Some stats from one of the crash groups caused by this bug - 7251 users but 2,770,397 reports total
Fatal Exception: java.lang.IllegalStateException: Apps may not schedule more than 100 distinct jobs
at android.os.Parcel.createException(Parcel.java:1958)
at android.os.Parcel.readException(Parcel.java:1918)
at android.os.Parcel.readException(Parcel.java:1868)
at android.app.job.IJobScheduler$Stub$Proxy.schedule(IJobScheduler.java:184)
at android.app.JobSchedulerImpl.schedule(JobSchedulerImpl.java:44)
at androidx.work.impl.background.systemjob.SystemJobScheduler.scheduleInternal(SystemJobScheduler.java:161)
at androidx.work.impl.background.systemjob.SystemJobScheduler.schedule(SystemJobScheduler.java:128)
at androidx.work.impl.Schedulers.schedule(Schedulers.java:91)
at androidx.work.impl.utils.ForceStopRunnable.run(ForceStopRunnable.java:100)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:764)
Caused by android.os.RemoteException: Remote stack trace:
at com.android.server.job.JobSchedulerService.scheduleAsPackage(JobSchedulerService.java:871)
at com.android.server.job.JobSchedulerService$JobSchedulerStub.schedule(JobSchedulerService.java:2554)
at android.app.job.IJobScheduler$Stub.onTransact(IJobScheduler.java:60)
at android.os.Binder.execTransact(Binder.java:731)
We had to halt ongoing release, downgrade WorkManager and release a new version. We verified crashes went down. Please prioritize and fix. If needed I can provide additional information: links to crash stacks etc
In our app we have many users (120M+ daily active users) running our app on different OS. After we released WorkManager 1.0.0 we got a number of users (not many users overall) but generating very high amount of crashers.
Some stats from one of the crash groups caused by this bug - 7251 users but 2,770,397 reports total
Fatal Exception: java.lang.IllegalStateException: Apps may not schedule more than 100 distinct jobs
at android.os.Parcel.createException(Parcel.java:1958)
at android.os.Parcel.readException(Parcel.java:1918)
at android.os.Parcel.readException(Parcel.java:1868)
at android.app.job.IJobScheduler$Stub$Proxy.schedule(IJobScheduler.java:184)
at android.app.JobSchedulerImpl.schedule(JobSchedulerImpl.java:44)
at androidx.work.impl.background.systemjob.SystemJobScheduler.scheduleInternal(SystemJobScheduler.java:161)
at androidx.work.impl.background.systemjob.SystemJobScheduler.schedule(SystemJobScheduler.java:128)
at androidx.work.impl.Schedulers.schedule(Schedulers.java:91)
at androidx.work.impl.utils.ForceStopRunnable.run(ForceStopRunnable.java:100)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:764)
Caused by android.os.RemoteException: Remote stack trace:
at com.android.server.job.JobSchedulerService.scheduleAsPackage(JobSchedulerService.java:871)
at com.android.server.job.JobSchedulerService$JobSchedulerStub.schedule(JobSchedulerService.java:2554)
at android.app.job.IJobScheduler$Stub.onTransact(IJobScheduler.java:60)
at android.os.Binder.execTransact(Binder.java:731)
We had to halt ongoing release, downgrade WorkManager and release a new version. We verified crashes went down. Please prioritize and fix. If needed I can provide additional information: links to crash stacks etc
su...@google.com <su...@google.com> #19
#18: Please file a new bug with the appropriate details. Please also try 1.0.1-rc01, which includes the fix for this issue, before filing.
ra...@google.com <ra...@google.com> #20
WorkManager's schedulers won't schedule the same job if JobScheduler is already aware of the job. This is because we know the jobId corresponding to a given WorkRequest id.
So the fact that more jobs are being scheduled is very interesting (because they should not be).
Can you please provide us with some more information ?
* How are you initializing WorkManager ?
* Are you creating any WorkRequest's on app startup ? If so what kind of work requests are they ?
* Are you scheduling work using JobScheduler on your own, apart from WorkManager ?
* How many users (total) and the population (total) got effected ? (Just to get an idea of % of users effected). What devices were they on? What versions of Android ?
* Can you send us a sample app that reproduces the problem ?
So the fact that more jobs are being scheduled is very interesting (because they should not be).
Can you please provide us with some more information ?
* How are you initializing WorkManager ?
* Are you creating any WorkRequest's on app startup ? If so what kind of work requests are they ?
* Are you scheduling work using JobScheduler on your own, apart from WorkManager ?
* How many users (total) and the population (total) got effected ? (Just to get an idea of % of users effected). What devices were they on? What versions of Android ?
* Can you send us a sample app that reproduces the problem ?
sa...@gmail.com <sa...@gmail.com> #21
I need some work.
Description
Version used: 1.0.0-rc01
Devices/Android versions reproduced on: Nexus 6, API 25 (7.1.1)
Description:
If an app crashes while a Worker is being executed, then new submitted Workers aren't executed, even after the app is restarted. The WorkManager will not resume any further work execution at any time, neither if waiting for several minutes.
So far, seen only when sending OneTimeWorkRequest works with enqueueUniqueWork or beginUniqueWork.
The issue started to appear in version 1.0.0-rc01
Find attached a test project. Once run the app shows the steps to reproduce the issue.