Flex Templates should allow configuring SDK Image in the template [393165250]

Assigned

Feature Request

Status Update

No update yet.

Description

ma...@delfina.com

created issue #1

Jan 29, 2025 10:08PM

Problem you have encountered:

When building a flex template, you can configure the base image (via gcloud dataflow flex-template build --image), but not the SDK image. The SDK image needs to be sent as a runtime flag.

For jobs that use the same image for both base and SDK, we don't want any version skew, i.e. we want to ensure the base and SDK are identical. But, the runtime environment may not necessarily know exactly what image is in the template (that's the whole point of the template after all), and so is stuck either trying to read and parse the template, or using :latest which risks version skew.

E.g. our flow is: Create a cloud build trigger to build both container and flex template on changes to our release branch In terraform, create a cloud scheduler job to send an HTTP request to launch a dataflow job with that template periodically

That scheduler job has no clue what version is pushed to the template, but that's where the sdk image flag needs to be set. Moving it into the template would let us set both at the same time.

What you expected to happen:

Base image and SDK image should be configurable in the same location to easily avoid skew.

Comments

va...@google.com <va...@google.com> Jan 30, 2025 06:10AM

Assigned to je...@google.com.

je...@google.com <je...@google.com> #2Jan 30, 2025 07:39AM

Reassigned to gc...@google.com.

Hello,

This issue report has been forwarded to the Cloud Dataflow Product team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.

je...@google.com <je...@google.com> #3Jan 31, 2025 04:54AM

Hi,

SDK container image is configurable, in the Optional Parameters section: https://screenshot.googleplex.com/8JRNXBpJdVyrqYT

ma...@delfina.com <ma...@delfina.com> #4Feb 3, 2025 09:24PM

Thanks for taking a look! I don't see anything in the "Optional Flags" section of the public facing documentation:

https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/build

I did a ctrl + f on "sdk" to be extra sure I wasn't just missing something, but it's still totally possible that I am!

Would you mind attaching an image of the screenshot rather than the

https://screenshot/ link which isn't accessible to non-Googlers?

je...@google.com <je...@google.com> #5Feb 5, 2025 06:36AM

Hi,

I sincerely apologize for the inconvenience caused.

I have attached the image now, could you please check?

dataflow.png

96 KB

View

Download

ma...@delfina.com <ma...@delfina.com> #6Feb 5, 2025 07:41PM

All good, thank you for the attached picture!

It looks like that is the creation of a dataflow job via the Cloud Console. Is it possible to set the SDK Image using gcloud dataflow flex-template build?

je...@google.com <je...@google.com> #7Feb 7, 2025 05:50AM

Hi,

It is not possible to set the SDK image using gcloud dataflow flex-template build because the launcher and SDK container image are different images, at minimum they have different entry points.

-SDK container image's entrypoint is a go binary that calls Beam SDK harness, usually not modified by the user.

-launcher image's entrypoint is a go binary that calls the user program to submit a Pipeline.

So to make things work one would need two images.

ma...@delfina.com <ma...@delfina.com> #8Feb 11, 2025 05:19PM

Thanks for the response!

Two separate things regarding that:

You absolutely can use the same image for both launcher and worker, I am doing that now! You do this by setting the entrypoint to the worker entrypoint, and it seems like the launcher must launch by overriding the entrypoint.
Even if using two different images, the important thing is being able to configure both of them at the same time, so that you know you won't have version skew. e.g. being able to say gcloud dataflow flex-template build --image=<launcher_image> --sdk_image=<worker_image>

ma...@delfina.com <ma...@delfina.com> #9Feb 11, 2025 09:03PM

For more context, here is an official tutorial on how to use the same custom image for launcher and worker.

To make the tutorial easy to mimic, it over-simplifies some things that are really important at enterprise scale. One of the things that it over-simplifies is the managing of image versions. In any kind of productionized environment, the "Build the Flex Template" step and the "Run the Flex Template" step are going to happen at different times in different places, introducing this problem of version skew. In many enterprise prod environments, ours for sure, the "build" step happens as a result of code changes in our repository (via cloud build trigger) and the "run" step happens via a google cloud scheduler job defined in terraform. Those two steps have no easy way to communicate with each other, so it's hard to coordinate such that they are running compatible versions.

Take the following example:

Pipeline as described above is pushed (gcr.io/my_pipeline:abcd) and runs hourly. Because there's no way to define the worker/SDK image at "build" time, the cloud scheduler job instructs data flow to use the latest version of the SDK image (gcr.io/my_pipeline:latest).
An engineer upgrades a third party dependency foo from version 1.0 to 2.0. That includes a breaking change where foo.read_data's kwarg delete_after_read changed from False, to True, so the engineer made sure to update every instance of foo.read_data to set delete_after_read to False in the same commit.
A dataflow pipeline kicks off with launcher from gcr.io/my_pipeline:abcd
Before the workers are started, a rebuild completes of the breaking change updating foo and gets tagged as :latest
The pipeline now sends code intended for foo at version 1.0 to a worker which has foo at version 2.0 installed, and so every call to foo.read_data unintentionally deletes data.

je...@google.com <je...@google.com> #10Feb 13, 2025 03:38AM

Hi,

Thanks for the information.

Could you please confirm if your issue is resolved?

ma...@delfina.com <ma...@delfina.com> #11Feb 13, 2025 05:25PM

My issue is not resolved, could you please escalate this?

je...@google.com <je...@google.com> #12Feb 19, 2025 08:04AM

Hello,

Thank you for contacting the Google Cloud support team.

I have gone through your reported issue, however it seems like this is an issue observed specifically at your end. It would need more specific debugging and analysis. To ensure a faster resolution and dedicated support for your issue, I kindly request you to file a support ticket by clicking here. Our support team will prioritize your request and provide you with the assistance you need.

Please note that the Issue Tracker is primarily meant for reporting commonly observed issues and requesting new features. For individual support issues, it is best to utilize the support ticketing system. I'm going to close this issue which will no longer be monitored. If you have any additional issues or concerns, please don’t hesitate to create a new issue on the Issue Tracker.

We appreciate your cooperation. Thank you!

ma...@delfina.com <ma...@delfina.com> #13Feb 19, 2025 05:49PM

Hi! I do not need support, this is a feature request as detailed in my comments above. The current configuration options available in dataflow do not easily allow updating jobs without version skew between the launcher and worker images. Rethinking those configuration options is the feature request.

I have provided a detailed example of how such version skew could happen and the potentially serious consequences it could have, and would like the engineering team to be aware of this.