Add user defined metadata into gee tasks [393989623]

Feature Request

Status Update

No update yet.

Description

ro...@marvinblue.earth

created issue #1

Feb 3, 2025 11:31AM

Feature Summary:
I want to be able to add some metadata (json) to each task i start so i could track for example who was the "client" this task was meant for (billing management for each client), what is the context the task was running from.
For now we are using the "description" field with is limited size string.

For example:
```
task = ee.batch.Export.image.toCloudStorage(
...
description="a string with length limitation",
userAttributes={"client_id": "<MY_SPECIFIC_CLIENT_ID>", "project": "app-123"}
)
```

Details about the situations or use cases where this feature would be valuable:
- biling management
- getting the task context (purpose, which flow triggered it, who is the client the data was meant for)

Comments

ro...@marvinblue.earth <ro...@marvinblue.earth> #2Feb 3, 2025 11:40AM

The most important part is that this metadata should be visible from task list log (ee.data.listOperations) so I could use it for my purposes

br...@google.com <br...@google.com> #3Feb 3, 2025 06:12PM

Thanks for the suggestion. If the string size limit was relaxed for the description field would that work for you purpose?

ro...@marvinblue.earth <ro...@marvinblue.earth> #4Feb 3, 2025 07:01PM

That's sounds like a great start!
We will need around 300 characters for saving our metadata on the description field:
- full file path in gcp: The destination file path in the gpc appears partially only on success flows, so we need to add it explicitly to the description.
- our own metadata

Message last modified on Feb 4, 2025 08:03AM

el...@gmail.com <el...@gmail.com> #5Feb 18, 2025 03:25PM

It would be helpful to have a parent task/run ID as an operation property, if user-set attributes are not possible. Perhaps inferred by IP address of client combined with the short request interval - i.e. a client submits one task every 2 seconds for 80 seconds, group these as one "run", 4 minutes of inactivity pass and additional tasks are submitted 2 seconds apart for 45 seconds, this becomes a distinct "run" --> return the hash-ID of these runs after a timeout interval. This would allow the user to filter operations based on the instance of their script that is responsible for a set of related pending tasks.

Example Use Case: I divide up all of my spatial reduction operations into batches using the number of individual features from a featureCollection as the numerator. I can assign an arbitrary denominator (batch size) that subsets the featureCollection into smaller tasks to balance computation/memory load and stay within limits while respecting the 3,000-task cap. Then I will monitor running/pending tasks by periodically querying operations, in order to trigger additional downstream work once all computation is complete. If one task fails, my entire run needs to be abandoned and reconfigured around whatever caused the error. Currently I must manually check for failed tasks as I do not want to request all tasks in history, but setting a time boundary is impractical as some runs can take 10 minutes, and others can take many hours, up to a day or more.

Benefit: If this feature was implemented, I could use the parent task/run ID to cancel all related operations. Less tedious that tracking individual task by 'name'

Additional benefit: In this case I am only concerned about the status of tasks in the aggregate - had I the ability I would query periodically the status of the parent task and if ANY child task has failed, terminate all others. If ALL child tasks have succeeded, move on to next step.

Additional benefit: This would allow multiple devs (we pay for 2 seats) to monitor only their own tasks, irrespective of the other's workflow.

I hope what I have described here is clear and I'm happy to discuss further. I did not want to open a new ticket because this issue sounded sufficiently alike - my apologies if this is too distinct from OP's issue.

Issue 393989623

Description

Issue summary

Comments

ro...@marvinblue.earth <ro...@marvinblue.earth> #2Feb 3, 2025 11:40AM

br...@google.com <br...@google.com> #3Feb 3, 2025 06:12PM

ro...@marvinblue.earth <ro...@marvinblue.earth> #4Feb 3, 2025 07:01PM

el...@gmail.com <el...@gmail.com> #5Feb 18, 2025 03:25PM

Add comment

Issue metadata