Assigned
Status Update
Comments
pu...@google.com <pu...@google.com>
ar...@google.com <ar...@google.com> #2
Hello,
This issue report has been forwarded to the Cloud Dialog-flow CX Engineering team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.
ca...@epidemicsound.com <ca...@epidemicsound.com> #3
This issue sounds highly related to recent changes I've encountered for unchanged TensorFlow Datasets code. We've looked quite a lot for code/data changes on our side but believe something changed within the Dataflow runtime outside of our control.
Tried with
--experiments=disable_worker_rolling_upgrade
--experiments=runner_harness_container_image=gcr.io/cloud-dataflow/v1beta3/unified-harness:20240708-002-disabled-lgi
as suggested above but that also crashed after processing all elements before the final resharding logic.
ERROR[dataflow_runner.py]: 2025-01-08T12:56:25.776Z: JOB_MESSAGE_ERROR: Workflow failed. Causes: S72:train_write/WriteFinalShards+train_write/CollectShardInfo/CollectShardInfo/KeyWithVoid+train_write/CollectShardInfo/CollectShardInfo/CombinePerKey/GroupByKey+train_write/CollectShardInfo/CollectShardInfo/CombinePerKey/Combine/Partial+train_write/CollectShardInfo/CollectShardInfo/CombinePerKey/GroupByKey/Write failed
ca...@epidemicsound.com <ca...@epidemicsound.com> #4
And updates on this?
Description
Workaround:
The current workaround is to run with a custom Runner V2 image that disables our large iterables changes (which are meant to improve the handling of large work items, but in some Tensorflow datasets cases, can cause this).
Can try running with the following:
--experiments=disable_worker_rolling_upgrade
--experiments=runner_harness_container_image=