Unique Application ids for Serverless Dataproc Spark jobs [293483573]

Assigned

Feature Request

Status Update

No update yet.

Description

cj...@google.com

created issue #1

Jul 27, 2023 07:40PM

This will create a public issue which anybody can view and comment on.

Please provide as much information as possible. At least, this should include a description of your issue and steps to reproduce the problem. If possible please provide a summary of what steps or workarounds you have already tried, and any docs or articles you found (un)helpful.

Problem you have encountered:

Dataproc Spark Serverless jobs that are launched at the same time (when a two jobs are launched back to back with the time difference in milliseconds). They get the same application id.

What you expected to happen:

Each job to have a unique application id irrespective of time or the cluster it is launched on or there should be a way to customize the application id.

Steps to reproduce: Create a PHS location in GCS location like - "gs://dataproc-phs-bucket/phs/event/spark-job-history" and then launch multiple Dataproc Serverless Spark jobs simultaneously within milliseconds of intervals with each other. This would in turn create duplicate application id for jobs launched at the same time.

Other information (workarounds you have tried, documentation consulted, etc):

Workarounds :

Add some time interval between job launches.
Add glob/* in the path for GCS bucket eg. gs://dataproc-phs-bucket/*/spark-job-history

IssueTracker