In some cases, a handler will receive a task that it definitively knows cannot ever be processed. For example, the payload might be fundamentally invalid (missing parameters, referring to a resource that the handler knows does not and will not ever exist).
Or the request method, authentication, or other aspects of the http request itself might not meet the handler's requirements / expectations.
In this case it would be great if we could report a failure / error response, but also mark that the task should not be retried.
How this might work:
An HTTP handler could provide an optional response header e.g. X-CloudTasks-Retry. The presence of this header in the response would override Cloud Tasks' default handling for the response status code.
For example, imagine the following HTTP conversation:
> GET /task/that-requires-http-post
> Host: my.task.handler
> // etc
>
< HTTP/1.1 400 Bad Request
< X-CloudTasks-Retry: never
< Content-Type: application/json
< {"msg": "Endpoint requires POST"}
When Cloud Tasks receives the response it would:
Note the 400 response code and record it, marking the execution as a failure
Note the X-CloudTasks-Retry: never header and abort scheduling of any further retries
This solution would solve the immediate issue and provide a lot of flexibility for application servers, for example:
There might be certain kinds of failures that the handler would like to try more than once but less than the maximum configured for the queue. For example, if it could be related to a configuration mismatch during a deployment event the handler might allow a couple of attempts before being confident the task contains invalid data. Handlers could look at the incoming X-CloudTasks-TaskRetryCount to decide whether to allow any more attempts.
The same header could be extended in future for related purposes. For example, a future syntax of e.g. X-CloudTasks-Retry: not-before 2020-07-28T14:30:00Z could work in parallel with https://issuetracker.google.com/issues/141314105 to allow the handler to specify custom retry options for an individual task. This would be useful where e.g. a handler might identify that the failure is due to e.g. an external system that is known to take some time to recover, a rate limit on a downstream API, or an ongoing blocking operation on a particular resource (perhaps a file upload that has not completed conversion of some kind) that does not affect processing of other tasks in the queue.
If applicable, reasons why alternative solutions are not sufficient:
Currently there are only two options for handlers:
Have the handler return a fake 2xx code, that it does not normally use for success responses.
Have the handler return an accurate HTTP code e.g. 400 / 422
fake 2xx
Sending a fake 2xx ensures the task is removed from the queue without further retries. However, it is not marked as a failure in Cloud Tasks, nor (by default) in e.g. logs / metrics systems monitoring the performance of the handler system. This makes it harder to detect, monitor and alert on these events. It can also be more confusing when debugging, unless developers understand why e.g. a 209 is being returned for an operation that has failed.
The fake 2xx also increases the risks around non-cloud-tasks traffic hitting an endpoint unexpectedly. It's harder to see in the logs, and the "success" response may cause some bots / crawlers etc to believe they have found a valid URL.
Accurate HTTP code
Sending an accurate HTTP code significantly improves the ease of normal monitoring, alerting and logging operations. It also ensures that any non-cloud-tasks clients are clear on the status of their request (e.g. if the handler sends a 400 or 404 to clients it is not expecting).
However, it comes at the potential cost of significant numbers of retries of tasks that cannot ever succeed - in particular, a misconfiguration or bad deployment could flood the queue with tasks containing invalid payloads. These will be repeatedly retried even though we definitively know all we can do is log them and drop them. As well as high traffic volume, the repeated attempts will also create significant noise for ops staff trying to resolve the issue (for example by inspecting logs to replay broken tasks with valid payloads).
Description
What you would like to accomplish:
In some cases, a handler will receive a task that it definitively knows cannot ever be processed. For example, the payload might be fundamentally invalid (missing parameters, referring to a resource that the handler knows does not and will not ever exist).
Or the request method, authentication, or other aspects of the http request itself might not meet the handler's requirements / expectations.
In this case it would be great if we could report a failure / error response, but also mark that the task should not be retried.
How this might work:
An HTTP handler could provide an optional response header e.g. X-CloudTasks-Retry. The presence of this header in the response would override Cloud Tasks' default handling for the response status code.
For example, imagine the following HTTP conversation:
When Cloud Tasks receives the response it would:
X-CloudTasks-Retry: never
header and abort scheduling of any further retriesThis solution would solve the immediate issue and provide a lot of flexibility for application servers, for example:
X-CloudTasks-Retry: not-before 2020-07-28T14:30:00Z
could work in parallel withIf applicable, reasons why alternative solutions are not sufficient:
Currently there are only two options for handlers:
fake
2xx code, that it does not normally use for success responses.fake 2xx
Sending a
fake 2xx
ensures the task is removed from the queue without further retries. However, it is not marked as a failure in Cloud Tasks, nor (by default) in e.g. logs / metrics systems monitoring the performance of the handler system. This makes it harder to detect, monitor and alert on these events. It can also be more confusing when debugging, unless developers understand why e.g. a 209 is being returned for an operation that has failed.The
fake 2xx
also increases the risks around non-cloud-tasks traffic hitting an endpoint unexpectedly. It's harder to see in the logs, and the "success" response may cause some bots / crawlers etc to believe they have found a valid URL.Accurate HTTP code
Sending an accurate HTTP code significantly improves the ease of normal monitoring, alerting and logging operations. It also ensures that any non-cloud-tasks clients are clear on the status of their request (e.g. if the handler sends a 400 or 404 to clients it is not expecting).
However, it comes at the potential cost of significant numbers of retries of tasks that cannot ever succeed - in particular, a misconfiguration or bad deployment could flood the queue with tasks containing invalid payloads. These will be repeatedly retried even though we definitively know all we can do is log them and drop them. As well as high traffic volume, the repeated attempts will also create significant noise for ops staff trying to resolve the issue (for example by inspecting logs to replay broken tasks with valid payloads).