Set the default pool's capacity to 100% in clusters with driver node groups [289402599]

Bug

Status Update

No update yet.

Description

im...@google.com

created issue #1

Jun 29, 2023 07:21PM

Problem you have encountered:

In clusters created with driver pools, the default queue's capacity (yarn.scheduler.capacity.root.default.capacity) is set to 50% while its maximum capacity (yarn.scheduler.capacity.root.default.maximum-capacity) is set to 100%. This means that a single user's jobs can not get more than half of the cluster's resources as user-limit (yarn.scheduler.capacity.root.default.user-limit) is set to 1 by default. (see these docs).

What you expected to happen:

Jobs being able to use 100% of cluster resources (regardless of their user)

Steps to reproduce:

Create a cluster with driver pools enabled [1]
Submit a long-running job that would allocates all available resources
Verify that 50% of available resources (outside the driver pools) remain idle

Other information (workarounds you have tried, documentation consulted, etc):

Changing the default pool's capacity to 100% or the user limit to 2.0 would fix the issue, however capacity scheduler configurations can't be submitted as properties during cluster creation (they're blacklisted) so this needs to be done manually or in startup/init scripts.

[1]

gcloud dataproc clusters create dp-ng  --region=us-central1    --driver-pool-size=1    --driver-pool-id=myng   --enable-component-gateway --quiet

Comments

tg...@nvidia.com <tg...@nvidia.com> #2Jun 29, 2023 08:09PM

Hi , Could you please clarify the issue description or share any screen shot of the problem where you are facing issue ?

im...@google.com <im...@google.com> #3Jun 30, 2023 04:27PM

Assigned to im...@google.com.

Hi, its a feature request. Can you change it to feature request?
I am requesting a dataproc image version that supports spark 3.4

https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2

im...@google.com <im...@google.com> #4Jul 1, 2023 02:34AM

Hello,

Thank you for reaching out to us with your request.

We have duly noted your feedback and will thoroughly validate it. While we cannot provide an estimated time of implementation or guarantee the fulfillment of the feature request , please be assured that your input is highly valued. Your feedback enables us to enhance our products and services.

sr...@nvidia.com <sr...@nvidia.com> #5Jul 5, 2023 07:51PM

Is it possible to get early access to the fix, so we or the end customer test?

im...@google.com <im...@google.com> #6Jul 5, 2023 08:32PM

The updated images are not yet built to be released externally, but you may apply the workaround by setting the following parameters in the startup scripts or initialization actions:

yarn.scheduler.capacity.root.default.user-limit-factor=2
yarn.scheduler.capacity.root.dataproc-driverpool-driver-queue.user-limit-factor=2

im...@google.com <im...@google.com> Oct 16, 2024 03:33PM

Status: New

Issue 289402599