Assigned

Feature Request

Status Update

No update yet.

Description

st...@de.ibm.com

created issue #1

Feb 3, 2025 06:17PM

GKE Upgrade Introduction

The GKE upgrade docs describe:

During automatic or manual node upgrades, PodDisruptionBudgets (PDBs) and Pod termination grace period are respected for a maximum of 1 hour. If Pods running on the node can't be scheduled onto new nodes after one hour, GKE initiates the upgrade anyway.

Problem

Usually GKE waits for up to one hour for protected pods (protected by PodDisruptionBudget) to disappear. But very often, GKE does not wait one hour and instead it deletes protected pods immediately.

The GKE logs indicate, that a protected pod is deleted immediately, when the GoogleContainerEngine fails to create an Eviction object for the pod with error context deadline exceeded.

Details

Here is the failing Eviction object creation as it appears in the GKE logs. Notice the "callerSuppliedUserAgent": "GoogleContainerEngine" and the error Timeout: request did not complete within requested timeout - context deadline exceeded.

I removed the private information from this log entry and replaced it with <my-...>.

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      "principalEmail": "service-892946895159@container-engine-robot.iam.gserviceaccount.com"
    },
    "authorizationInfo": [
      {
        "granted": true,
        "permission": "io.k8s.core.v1.pods.eviction.create",
        "resource": "core/v1/namespaces/<my-namespace>/pods/<my-pod-name>/eviction"
      }
    ],
    "methodName": "io.k8s.core.v1.pods.eviction.create",
    "request": {
      "@type": "policy.k8s.io/v1beta1.Eviction",
      "apiVersion": "policy/v1beta1",
      "deleteOptions": {
        "gracePeriodSeconds": 70
      },
      "kind": "Eviction",
      "metadata": {
        "creationTimestamp": null,
        "name": "<my-pod-name>",
        "namespace": "<my-namespace>"
      }
    },
    "requestMetadata": {
      "callerIp": "127.0.0.1",
      "callerSuppliedUserAgent": "GoogleContainerEngine"
    },
    "resourceName": "core/v1/namespaces/<my-namespace>/pods/<my-pod-name>/eviction",
    "response": {
      "@type": "core.k8s.io/v1.Status",
      "apiVersion": "v1",
      "code": 504,
      "details": {},
      "kind": "Status",
      "message": "Timeout: request did not complete within requested timeout - context deadline exceeded",
      "metadata": {},
      "reason": "Timeout",
      "status": "Failure"
    },
    "serviceName": "k8s.io",
    "status": {
      "code": 4,
      "message": "Timeout: request did not complete within requested timeout - context deadline exceeded"
    }
  },
  "insertId": "6bfe4018-305a-4dbc-bdbe-958023ef820c",
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "project_id": "<my-project-id>",
      "location": "us-central1-c",
      "cluster_name": "<my-cluster-name>"
    }
  },
  "timestamp": "2024-11-06T10:52:48.279820Z",
  "labels": {
    "authorization.k8s.io/reason": "access granted by IAM permissions.",
    "apiserver.latency.k8s.io/response-write": "1.48µs",
    "apiserver.latency.k8s.io/total": "34.007549298s",
    "authorization.k8s.io/decision": "allow",
    "apiserver.latency.k8s.io/serialize-response-object": "472.475µs",
    "apiserver.latency.k8s.io/apf-queue-wait": "65.463µs"
  },
  "logName": "projects/<my-project>/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "6bfe4018-305a-4dbc-bdbe-958023ef820c",
    "producer": "k8s.io",
    "first": true,
    "last": true
  },
  "receiveTimestamp": "2024-11-06T10:52:53.902285590Z"
}

Expectation

During upgrade, GKE should never delete a pod if it fails to create an Eviction object for this pod. GKE must respect the PodDisruptionBudget for up to one hour as described in GKE upgrade docs.

Comments

st...@de.ibm.com <st...@de.ibm.com> #2
Restricted+
Feb 3, 2025 06:27PM

Learn More

ba...@google.com <ba...@google.com> Feb 4, 2025 04:25AM

Assigned to ka...@google.com.

ka...@google.com <ka...@google.com> #3Feb 4, 2025 12:02PM

Reassigned to gc...@google.com.

Wow, I really appreciate your fast and correct response. That worked. Thank you for your help.

IssueTracker

PodDisruptionBudget is ignored and pods are deleted unexpectedly during GKE upgrade

Status Update

Description

GKE Upgrade Introduction

Problem

Details

Expectation

Comments

st...@de.ibm.com <st...@de.ibm.com> #2
Restricted+
Feb 3, 2025 06:27PM

ba...@google.com <ba...@google.com> Feb 4, 2025 04:25AM

ka...@google.com <ka...@google.com> #3Feb 4, 2025 12:02PM

Issue 394061202

Description

GKE Upgrade Introduction

Problem

Details

Expectation

Issue summary

Comments

st...@de.ibm.com <st...@de.ibm.com> #2 Restricted+ Feb 3, 2025 06:27PM

ba...@google.com <ba...@google.com> Feb 4, 2025 04:25AM

ka...@google.com <ka...@google.com> #3Feb 4, 2025 12:02PM

Add comment

Issue metadata

st...@de.ibm.com <st...@de.ibm.com> #2
Restricted+
Feb 3, 2025 06:27PM