Status Update
Comments
po...@google.com <po...@google.com> #2
Product team is working on this issue, there is no ETA yet.
jo...@doit.com <jo...@doit.com> #3
ya...@doit-intl.com <ya...@doit-intl.com> #4
My client is reporting he is still experiencing this issue (GKE 1.18.17-gke.1901)
tf...@google.com <tf...@google.com> #5
Update:
We've confirmed that the
Hence to fix the, the cluster's NODE POOLS must be on 1.20+ versions to have the feature enabled. Another workaround is to simply not use preemptible VMs for node pools on 1.19 and lower.
Note that this issue has little to no impact on workloads. As long as the pod is backed by controller (deployment/statefulset, etc), when a pod runs into the NodeAffinity issue a new pod is immediately created and rescheduled. However, it creates a lot false positives as the pod will stay in a failed state and will make its controller page in the Cloud Console UI seem to be in a failed state as well even if all desired replicas are healthy and running.
Graceful node termination feature is also very noisy in the same way as the Shutdown failed status of pods bubbles up to the controllers page in Cloud Console UI as well. Our UI team is working on enhancing the UX for this feature.
In the meantime, you can clean up pods as mentioned in the
wa...@google.com <wa...@google.com> #6
ji...@google.com <ji...@google.com> #7
I think graceful node termination feature is now broken I think. See the PR trying to fix the issue, but it got concerns.
So I think NodeAffinity issue might still exist.
tfmenard@ do you have some details about why NodeAffinity issue due to a non graceful termination of pods when a preemption occurs? Thanks!
ma...@gmail.com <ma...@gmail.com> #8
al...@google.com <al...@google.com>
ch...@doit.com <ch...@doit.com> #9
al...@google.com <al...@google.com> #10
After investigation it was determined that
se...@google.com <se...@google.com> #11
A node pool version is 1.22.8-gke.202 and seeing Pod errors: NodeAffinity still. ( Pod was terminated in response to imminent node shutdown. )
Can we confirm if the issue was happened again?
an...@google.com <an...@google.com> #12
Nodepool version 1.23.5-gke.1503 is also facing the same issue. The customer sees a "Pod Predicate NodeAffinity failed" error in one of the pods. Cx workload is not affected as a new pod with the same image managed to get started on the same node. The problematic pod is backed-up by ReplicaSet.
[Deleted User] <[Deleted User]> #13
mi...@rewe-digital.com <mi...@rewe-digital.com> #14
kr...@quanterall.com <kr...@quanterall.com> #15
um...@thecloudside.com <um...@thecloudside.com> #16
je...@hivebrite.com <je...@hivebrite.com> #17
st...@doit.com <st...@doit.com> #18
Seems to suffer from the same on 1.22.12-gke.2300
(bothmaster & nodes).
[Deleted User] <[Deleted User]> #19
ha...@nais.io <ha...@nais.io> #20
la...@gmail.com <la...@gmail.com> #21
be...@google.com <be...@google.com> #22 Restricted+
di...@aalyria.com <di...@aalyria.com> #23
ra...@google.com <ra...@google.com> #24
ji...@google.com <ji...@google.com> #25
Is this issue being actively worked on by any team? Looks like there is a timing issue (race), where node label is not yet updated when pods are being scheduled?
/cc @msau @liggitt
li...@google.com <li...@google.com> #26
IIRC, this was reported in
That PR ensured nodes have a current view of their API labels before they take action on API sourced pods.
It doesn't look like this issue is on any team's radar. Is there a reproducer?
/cc @acondor for scheduler visibility /cc @porterdavid for node visibility
li...@google.com <li...@google.com> #27
Along with any reproducer, it would help to have a trace of the following:
- Output from a watch of the Node object (to show any label changes), e.g.
kubectl get node/$name -o json --output-watch-events > nodes.json
- Output from a watch of the Pod objects (to show any pod creations / schedulings), e.g.
kubectl get pods -o json --output-watch-events -n $pod-namespace > pods.json
dh...@google.com <dh...@google.com> #28
ma...@liveramp.com <ma...@liveramp.com> #29
ma...@liveramp.com <ma...@liveramp.com> #30 Restricted
qi...@google.com <qi...@google.com> #31
Adding another repro sample, we consistently encountered this on 1.24.11-gke.1000. We can post the relevant outputs if there is an appropriate thread to share.
po...@google.com <po...@google.com> #32
Providing a consistent repro would be very useful! Thank you!
ch...@google.com <ch...@google.com>
gi...@stiga.com <gi...@stiga.com> #33
In the same day, in the same cluster, I had also some internal DNS problem (getaddrinfo EAI_AGAIN).
Description
The original issue in https://issuetracker.google.com/181689705 was fixed as https://issuetracker.google.com/181689705#comment5 . But there are other conditions that were not handled in the original fix. It stills keep the issue with the pod with Affinity Failed, principally in preemptibles nodes.
The latest issue ishttps://github.com/kubernetes/kubernetes/issues/100467 .