Dataproc spark app ID customization/de-duplication [378440855]

Bug

Status Update

No update yet.

Description

rr...@google.com

created issue #1

Nov 11, 2024 11:22AM

Problem you have encountered:

Dataproc Spark Serverless jobs that are launched at the same time (when a two jobs are launched back to back with the time difference in milliseconds). They get the same application id.

What you expected to happen:

Each job to have a unique application id irrespective of time or the cluster it is launched on or there should be a way to customize the application id or deduplicate

Steps to reproduce:

Create a PHS location in GCS location like - "gs://dataproc-phs-bucket/phs/event/spark-job-history" and then launch multiple Dataproc Serverless Spark jobs simultaneously within milliseconds of intervals with each other. This would in turn create duplicate application id for jobs launched at the same time.

Workarounds

Add some time interval between job launches.
Add glob/* in the path for GCS bucket eg. gs://dataproc-phs-bucket/*/spark-job-history

IssueTracker