Highly Available metrics-server [393364550]

Assigned

Feature Request

Status Update

No update yet.

Description

se...@google.com

created issue #1

Jan 30, 2025 05:00PM

This is a feature request to make the metrics-server component of GKE highly available.

Currently, metrics-server runs as a Deployment with a single replica.

Problem
The metrics-server is registered as the backend for the

metrics.k8s.io/v1beta1 APIService.

When the metrics-server Pod is unhealthy, disrupted, or rescheduled, it can no longer serve API requests for the metrics endpoint.

This causes disruption to Kubernetes 1st-party & 3rd-party controllers, especially those that use API Discovery to discover which API groups and resources are available.

Two notable error examples:

* The Kubernetes namespace garbage collector fails to fully clean up Namespaces when the metrics-server is unavailable.
* Config Sync fails to sync and/or update resource status when the metrics-server is unavailable.

Possible Solution
The metrics-server Deployment used by GKE does not specify a replica count, which makes it default to 1 replica.

The official component GitHub repo has example YAML for deploying a highly available metrics-server:

https://github.com/kubernetes-sigs/metrics-server/?tab=readme-ov-file#high-availability

In addition to the recommended configuration, it would also be a good idea to define the PodDisruptionBudget to avoid both Pods being disrupted at the same time, and use enable-aggregator-routing to share the traffic load between the instances.

Cost & Node Requirement Concerns
For metrics-server to be made HA requires clusters to have at least 2 nodes. So some tweaks may be required for this solution to work on single-node and zero-node clusters, like Autopilot.

One way to handle this might be to create a simple controller that modifies the metrics-server Deployment config depending on how many Nodes are in the cluster at any given time. This way the config could be changed to be single-replica on one-node clusters, or even scale to zero on zero-node clusters without causing constant errors about the Deployment not having any healthy replicas.

Comments

ka...@google.com <ka...@google.com> Jan 31, 2025 04:46AM

Assigned to ma...@google.com.

ma...@google.com <ma...@google.com> #2Jan 31, 2025 10:23AM

Reassigned to gc...@google.com.

This was partially addressed by

https://android-review.googlesource.com/c/platform/frameworks/support/+/2076902, available now in 1.1 RC02, see

b/230665435

. As compilation on user builds requires a full target reinstall (which is a behavior change), we offer an opt out, which can be used to accomplish this feature request: 1) configure every macrobench to use `CompilationMode.Full()` 1) manually issuing the compile command `cmd package compile -f -m speed <packagename>` for your target 1) pass the instrumentation arg `androidx.benchmark.compilation.enable` = `false` to skip compilation/reinstall for each macrobenchmark. This should still give you the numbers you've been seeing, while avoiding the cost of a large AOT each test. Leaving this bug open, since in general we should be able to do this more automatically for everything without warmup driven profiles. (Somewhat related bug - there have been excess compilations issued specifically for `Compilation.None`, `StartupMode.COLD` benchmarks, which has been fixed, but not shipped publicly yet:

b/231976084

)

Issue 393364550

Description

Issue summary

Comments

ka...@google.com <ka...@google.com> Jan 31, 2025 04:46AM

ma...@google.com <ma...@google.com> #2Jan 31, 2025 10:23AM

Add comment

Issue metadata