Error building tensorflow dataset, while running in Dataflow, without using a DirectRunner. [368255186]

Assigned

Bug

Status Update

No update yet.

Description

ha...@google.com

created issue #1

Sep 19, 2024 10:31AM

The issue with the working pipeline, building tensorflow data sets that were working absolutely fine. It stops working for a particular data set that is running.

Workaround:

The current workaround is to run with a custom Runner V2 image that disables our large iterables changes (which are meant to improve the handling of large work items, but in some Tensorflow datasets cases, can cause this).

Can try running with the following:

--experiments=disable_worker_rolling_upgrade
--experiments=runner_harness_container_image=

gcr.io/cloud-dataflow/v1beta3/unified-harness:20240708-002-disabled-lgi

Comments

pu...@google.com <pu...@google.com> Sep 20, 2024 07:31AM

Assigned to ar...@google.com.

ar...@google.com <ar...@google.com> #2Sep 20, 2024 09:07AM

Reassigned to gc...@google.com.

Hello,

This issue report has been forwarded to the Cloud Dialog-flow CX Engineering team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.

ca...@epidemicsound.com <ca...@epidemicsound.com> #3Jan 8, 2025 06:28PM

This issue sounds highly related to recent changes I've encountered for unchanged TensorFlow Datasets code. We've looked quite a lot for code/data changes on our side but believe something changed within the Dataflow runtime outside of our control.

https://github.com/tensorflow/datasets/issues/10971

Tried with

--experiments=disable_worker_rolling_upgrade  
--experiments=runner_harness_container_image=gcr.io/cloud-dataflow/v1beta3/unified-harness:20240708-002-disabled-lgi

as suggested above but that also crashed after processing all elements before the final resharding logic.

ERROR[dataflow_runner.py]: 2025-01-08T12:56:25.776Z: JOB_MESSAGE_ERROR: Workflow failed. Causes: S72:train_write/WriteFinalShards+train_write/CollectShardInfo/CollectShardInfo/KeyWithVoid+train_write/CollectShardInfo/CollectShardInfo/CombinePerKey/GroupByKey+train_write/CollectShardInfo/CollectShardInfo/CombinePerKey/Combine/Partial+train_write/CollectShardInfo/CollectShardInfo/CombinePerKey/GroupByKey/Write failed

ca...@epidemicsound.com <ca...@epidemicsound.com> #4Jan 14, 2025 02:00PM

And updates on this?

Issue 368255186

Description

Issue summary

Comments

pu...@google.com <pu...@google.com> Sep 20, 2024 07:31AM

ar...@google.com <ar...@google.com> #2Sep 20, 2024 09:07AM

ca...@epidemicsound.com <ca...@epidemicsound.com> #3Jan 8, 2025 06:28PM

ca...@epidemicsound.com <ca...@epidemicsound.com> #4Jan 14, 2025 02:00PM

Add comment

Issue metadata