Assigned
Status Update
Comments
je...@google.com <je...@google.com>
ja...@google.com <ja...@google.com> #2
Comment has been deleted.
ja...@google.com <ja...@google.com> #3
Hello,
This issue report has been forwarded to the Cloud Dataproc Product team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.
Description
This will create a public issue which anybody can view and comment on.
Please provide as much information as possible. At least, this should include a description of your issue and steps to reproduce the problem. If possible please provide a summary of what steps or workarounds you have already tried, and any docs or articles you found (un)helpful.
Problem you have encountered:
Natively dataproc does not support costs per job. This feature request is created on behalf of a customer. Here are the business justification for the customer in both technical & financial terms:
I wanted to mention that in our eyes such a feature has a high value. We are running multiple streams on the same GKE cluster, there is no visibility for us to understand for each stream how much it costs. Such billing granularity can provide us with a lot of important details
Do we have a stream with very high costs ? if so, it will alert us that it something that we need to investigate.
Do we have anomalies in our stream during the day? if so, we could try to find if it is a matter of higher traffic, spark tuning, etc.. It for sure will highlight for us to check it out.
We are now in the middle of comparing biglake iceberg unmanaged tables vs. writing to BQ. Currently, we are lack in information regarding the cost differences.
What you expected to happen:Having billing granularity per job once they are finished.
Steps to reproduce: Not reproducible.
Other information (workarounds you have tried, documentation consulted, etc):
We have tried updating the running jobs by putting specific labels[1], but those labels do not appear in the billing console available filters.
Users use case is streaming which makes things bit complicated to do a real time cost analysis considering lot of moving pieces when calculating the cost.
For example if cx's job only utilize 50% of the resources, do we count that job's cost as 50% of (instances charges + dataproc surcharges) for that period?
Another example is when multiple jobs running in parallel but not fully overlapped, we need super granular metric points to determine each job took how much resources at given timestamp, that will be complicated and requires lots of effort to design and implement.
But given the specific impacts this would be a very useful feature to design.
Links:
[1]https://cloud.google.com/dataproc/docs/guides/creating-managing-labels#creating_and_using_labels