Feature Request P2
Status Update
Comments
ro...@marvinblue.earth <ro...@marvinblue.earth> #2
The most important part is that this metadata should be visible from task list log (ee.data.listOperations) so I could use it for my purposes
br...@google.com <br...@google.com> #3
Thanks for the suggestion. If the string size limit was relaxed for the description
field would that work for you purpose?
ro...@marvinblue.earth <ro...@marvinblue.earth> #4
That's sounds like a great start!
We will need around 300 characters for saving our metadata on the description field:
- full file path in gcp: The destination file path in the gpc appears partially only on success flows, so we need to add it explicitly to the description.
- our own metadata
We will need around 300 characters for saving our metadata on the description field:
- full file path in gcp: The destination file path in the gpc appears partially only on success flows, so we need to add it explicitly to the description.
- our own metadata
el...@gmail.com <el...@gmail.com> #5
It would be helpful to have a parent task/run ID as an operation property, if user-set attributes are not possible. Perhaps inferred by IP address of client combined with the short request interval - i.e. a client submits one task every 2 seconds for 80 seconds, group these as one "run", 4 minutes of inactivity pass and additional tasks are submitted 2 seconds apart for 45 seconds, this becomes a distinct "run" --> return the hash-ID of these runs after a timeout interval. This would allow the user to filter operations based on the instance of their script that is responsible for a set of related pending tasks.
Example Use Case: I divide up all of my spatial reduction operations into batches using the number of individual features from a featureCollection as the numerator. I can assign an arbitrary denominator (batch size) that subsets the featureCollection into smaller tasks to balance computation/memory load and stay within limits while respecting the 3,000-task cap. Then I will monitor running/pending tasks by periodically querying operations, in order to trigger additional downstream work once all computation is complete. If one task fails, my entire run needs to be abandoned and reconfigured around whatever caused the error. Currently I must manually check for failed tasks as I do not want to request all tasks in history, but setting a time boundary is impractical as some runs can take 10 minutes, and others can take many hours, up to a day or more.
Benefit: If this feature was implemented, I could use the parent task/run ID to cancel all related operations. Less tedious that tracking individual task by 'name'
Additional benefit: In this case I am only concerned about the status of tasks in the aggregate - had I the ability I would query periodically the status of the parent task and if ANY child task has failed, terminate all others. If ALL child tasks have succeeded, move on to next step.
Additional benefit: This would allow multiple devs (we pay for 2 seats) to monitor only their own tasks, irrespective of the other's workflow.
I hope what I have described here is clear and I'm happy to discuss further. I did not want to open a new ticket because this issue sounded sufficiently alike - my apologies if this is too distinct from OP's issue.
Example Use Case: I divide up all of my spatial reduction operations into batches using the number of individual features from a featureCollection as the numerator. I can assign an arbitrary denominator (batch size) that subsets the featureCollection into smaller tasks to balance computation/memory load and stay within limits while respecting the 3,000-task cap. Then I will monitor running/pending tasks by periodically querying operations, in order to trigger additional downstream work once all computation is complete. If one task fails, my entire run needs to be abandoned and reconfigured around whatever caused the error. Currently I must manually check for failed tasks as I do not want to request all tasks in history, but setting a time boundary is impractical as some runs can take 10 minutes, and others can take many hours, up to a day or more.
Benefit: If this feature was implemented, I could use the parent task/run ID to cancel all related operations. Less tedious that tracking individual task by 'name'
Additional benefit: In this case I am only concerned about the status of tasks in the aggregate - had I the ability I would query periodically the status of the parent task and if ANY child task has failed, terminate all others. If ALL child tasks have succeeded, move on to next step.
Additional benefit: This would allow multiple devs (we pay for 2 seats) to monitor only their own tasks, irrespective of the other's workflow.
I hope what I have described here is clear and I'm happy to discuss further. I did not want to open a new ticket because this issue sounded sufficiently alike - my apologies if this is too distinct from OP's issue.
Description
I want to be able to add some metadata (json) to each task i start so i could track for example who was the "client" this task was meant for (billing management for each client), what is the context the task was running from.
For now we are using the "description" field with is limited size string.
For example:
```
task = ee.batch.Export.image.toCloudStorage(
...
description="a string with length limitation",
userAttributes={"client_id": "<MY_SPECIFIC_CLIENT_ID>", "project": "app-123"}
)
```
Details about the situations or use cases where this feature would be valuable:
- biling management
- getting the task context (purpose, which flow triggered it, who is the client the data was meant for)