IC Cluster authentication/authorization configuration [384553523]

Assigned

Feature Request

Status Update

No update yet.

Description

cj...@colliertech.org

created issue #1

Dec 16, 2024 09:29PM

Create a "type" of cluster for the purpose of single user workloads. This cluster is intended for the use of a single principal. All GCS, BigQuery, etc. service interactions should occur on behalf of the user who created the cluster. The service account should be that user's principal.

Please add test to exercise a use case

500 data scientists
a lot of groups (100+)
cannot create 100 service accounts

When I create a cluster, it currently uses project service account or the service account specified when cluster is created.

Instead, it should use my own credentials for interacting with GCS or BigQuery

The way it was working 1.5+ years ago, in the middle of 2023

grant service account access to GCS bucket

when reads happen, read should be executed as my user, not the service account

authorization should be granted by groups

when I create a cluster, I should be able to access the next service using my own principal rather than granting the permissions to the service account.

for IC cluster, only I will have access to. Access will only come from my user. There is no shared concept in this IC cluster.

For general purpose (not IC cluster), access is determined at the time of request (GCS, BigQuery, whatever). The user who launched the job will be the user as whom the service requests are issued.

=== For internal use only ===

go/vgo/55137450 # SME Consult
go/vgo/55085759 # Vector Case

go/vgo/57661976 # SME Consult
go/vgo/57500204 # Vector Case

Comments

cj...@google.com <cj...@google.com> #2Dec 16, 2024 10:47PM

Apologies, I intended to create a feature request not a bug. I will create a feature request instead.

cj...@google.com <cj...@google.com> #3Dec 16, 2024 11:00PM

Hello,

This issue report has been forwarded to the Cloud Dataproc Product team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.

va...@google.com <va...@google.com> Dec 17, 2024 06:16AM

Assigned to je...@google.com.

je...@google.com <je...@google.com> #4Dec 18, 2024 06:58AM

Reassigned to gc...@google.com.

Hello,

cj...@google.com <cj...@google.com> #5Dec 20, 2024 06:26AM

Thank you Sushma!

cj...@colliertech.org <cj...@colliertech.org> #6Mar 4, 2025 09:40PM

Customer requests RBAC via IAM service account on a per-job basis.

cj...@colliertech.org <cj...@colliertech.org> #7Mar 4, 2025 09:44PM

REFACTORED

Message last modified on Mar 4, 2025 10:24PM

cj...@google.com <cj...@google.com> #8Mar 4, 2025 10:54PM

Hello Prakash and Yaswanth,

I spoke with one of our engineers, and they suggested that the sa multi-tenancy feature[1] might provide a partial solution to your problem, or some building blocks that can help you to produce something close to a solution as I discuss the issue with engineering.

Here is a response from product engineering:

Is the idea that you want to run a job inside the cluster that uses a different SA than the VM default SA for connections to GCS, BigQuery, etc.? We have a feature that can do this:

https://cloud.google.com/dataproc/docs/concepts/iam/sa-multi-tenancy

The awkward part is that you need to know upfront all of the users and all of the service accounts they map to so that you can declare configuration during cluster creation. You can't add, remove or modify this mapping on a running cluster (though we have work in progress to improve this).

It can't be group-based. You really need to know all of the users and SAs at cluster creation, and you can't make changes except by deleting and recreating the cluster.

The other possibility is I know of at least one customer that ships exported SA JSON key files into their cluster, and then they configure the GCS connector in their jobs to use that JSON key file instead of the VM credentials. The drawback here is the extra configuration overhead and exported key files are not considered security best practice. It's easy to leak them into other systems with insufficient protections.

This sounds like an anti-pattern that your security team would object to.

[1] https://cloud.google.com/dataproc/docs/concepts/iam/sa-multi-tenancy

pr...@verizon.com <pr...@verizon.com> #9Mar 5, 2025 03:49PM

SA multi-tenancy feature is may not fit our requirements, this will use internally service accounts.

500 data scientists
a lot of groups (100+)
cannot create 100 service accounts

When I create a cluster, it currently uses project service account or the service account specified when cluster is created.

Instead, it should use my own credentials for interacting with GCS or BigQuery