Infeasible

Bug

Status Update

No update yet.

Description

le...@practiv.com

created issue #1

Apr 3, 2025 11:39PM

Context

I have two simple PHP apps running in GKE as Kubernetes Deployments. They are APIs supporting a website, named as api-deployment and plugin-id-deployment. The api-deployment is associated with a service named api-service, which is used by the plugin-id application, as it makes calls to the api-deployment workload. The api-deployment has a HPA associated with it.

The Problem:

We noticed that, when traffic at the website starts dropping after peak lead and the HPA starts scaling down the api-deployment, a few requests from the plugin-id to the api-deployment app fail. By looking into logs we can see that GKE is sending a command to terminate some of the pods before it updates the api-service (via endpoint slices) to mark these pods as "terminating" and "not ready". As a consequence, we can see that the pods start announcing that a SIGTERM has been received while the Kubernetes service is still sending new traffic to these pods, which results in some request failures.

You can see logs attached showing the issue.

Note1: Ignore the fact that SIGWINCH is being reported instead of SIGTERM, that's because Apache2 misuses these signals.

Note2: The api-deployment has a graceful termination period of 30s, but its container terminates very quickly, it only needs a second to respond to pending requests and shutdown. Incoming requests start being rejected when the termination signal is received.

What you expected to happen:

I expected the service to mark some of the pods as "terminating" and "not ready" BEFORE they were deleted, to ensure availability. In other words, I expected the service endpoint and endpoint slices requests to be logged before the deletion of the pods and the termination signal logs.

Steps to reproduce:

You don't really need a HPA to reproduce, you can manually force a scale down which produces the same result.

Create 2 simple deployments and one service that's used by one to connect to the other. App A calls app B via this service. Also, create an ingress and a service for app A, so you can invoke its endpoint externally and make A call your downstream app B on each request. Make sure app B logs the errors from the calls to app A and doesn't implement any retry mechanism
Increase the number of replicas (e.g. to 15) on app B, wait for them to be running.
Start a load test command locally to send requests to app A simulating a high traffic situation. Example of command: wrk -t15 -c20 -d30s --latency --timeout 3 https://myapp.com/app-a. This command tests for 30 seconds
Quickly, as soon as you've started the command on the previous step, scale down your app A deployment from 15 to 3 instances.

You should see that wrk reports failing requests

Workaround attempted

I've tried adding apreStop hook an deployment B (api-deployment) to make it sleep for 5 seconds. It seems to reduce the number of errors reported/logged, but doesn't completely resolve the issue.

This page, which was used as a reference, details the termination process for pods, which aligns with what was described here.

WRK report.png

39 KB

View

Download

GKE Logs.png

519 KB

View

Download

Comments

le...@practiv.com <le...@practiv.com> #2Apr 4, 2025 12:26AM

Correction:
So, the end result is that usb_write() sometimes sends a ZLP when it doesn't need to
should be
So, the end result is that usb_write() sometimes sends a ZLP when it doesn't need to, and sometimes neglects to send a ZLP when it should

le...@practiv.com <le...@practiv.com> #3Apr 4, 2025 12:42AM

Note that the CHECK_LE has since been replaced with HandleError(), which causes adbd to not abort...but that stowaway amessage header is being dropped all the same...and so the behavior is pretty much undefined, I imagine. I'm not sure the switch to HandleError was a good idea, since it more or less swept the real problem under the rug.

ba...@google.com <ba...@google.com> Apr 10, 2025 04:09AM

Assigned to ma...@google.com.

ma...@google.com <ma...@google.com> #4Apr 10, 2025 12:24PM

Status: Won't Fix (Infeasible)

good bug report, thanks! i'm assuming this isn't darwin-specific either --- it looks like we have similar logic using masks in the linux and windows backends and in the libusb backend too.

I'm not sure the switch to HandleError was a good idea, since it more or less swept the real problem under the rug.

yeah, i know what you mean, but it's also hard to argue with this logic in the commit message that made that change:

These CHECKs are expected to happen if the client does the wrong thing,
so we probably shouldn't be aborting in adbd.

a CHECK in the client (as you suggested earlier) would probably have been the best idea... postel's law and all that :-)

IssueTracker

Service failing intermittently during deployment scale-down

Status Update

Description

Context

The Problem:

What you expected to happen:

Steps to reproduce:

Workaround attempted

Comments

le...@practiv.com <le...@practiv.com> #2Apr 4, 2025 12:26AM

le...@practiv.com <le...@practiv.com> #3Apr 4, 2025 12:42AM

ba...@google.com <ba...@google.com> Apr 10, 2025 04:09AM

ma...@google.com <ma...@google.com> #4Apr 10, 2025 12:24PM

Issue 408244888

Description

Context

The Problem:

What you expected to happen:

Steps to reproduce:

Workaround attempted

Issue summary

Comments

le...@practiv.com <le...@practiv.com> #2Apr 4, 2025 12:26AM

le...@practiv.com <le...@practiv.com> #3Apr 4, 2025 12:42AM

ba...@google.com <ba...@google.com> Apr 10, 2025 04:09AM

ma...@google.com <ma...@google.com> #4Apr 10, 2025 12:24PM

Add comment

Issue metadata