Assigned
Status Update
Comments
je...@google.com <je...@google.com>
je...@google.com <je...@google.com> #2
Apologies, I intended to create a feature request not a bug. I will create a feature request instead.
je...@google.com <je...@google.com> #3
Hello,
This issue report has been forwarded to the Cloud Dataproc Product team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.
Description
Hopefully this is not a duplicate as I could not find any related issues.
After an update from Dataproc debian 2.1 to 2.2 I noticed an explosion of the "Time series ingestion requests per minute".
Digging the cause of the issue I finally found that the culprit was the "npd" process running on Dataproc master VM, using
> sudo journalctl -u npd.service -f
I can see a bunch of errors such as:
> npd[5946]: 2024/04/17 13:48:34 Failed to export to Stackdriver: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[48].metric.type had an invalid value: The metric type must be a URL-formatted string with a domain and non-empty path.; Field timeSeries[199].metric.type had an invalid value: The metric type must be a URL-formatted string with a domain and non-empty path.; Field timeSeries[84].metric.type had an invalid value: The metric type must be a URL-formatted string with a domain and non-empty path.; Field timeSeries[120].metric.type had an invalid value
Or
> npd[5946]: Failed to log metrics to API: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older end time than the most recent point.: cloud_dataproc_job{job_uuid:,job_id:attempt_timestamp_1713360606123,region:europe-west1} timeSeries[0]:
After a while it starts to show as:
> Data points cannot be written more than approximately 24 hours in the past, specifically no more than 24h in the past.
As if it never gave up trying to retry the request *even* if said request is invalid.
Given the error I suspect the problem occurred with the introduction of YARN metrics in October:
The only workaround I could find is using an empty {} in those files:
> /usr/local/share/google/dataproc/npd-config/{yarn-nm-monitor.json,yarn-rm-monitor.json}
And restart NPD service:
> sudo systemctl restart npd.service
But it's a real pain since I'm performing rolling update of Dataproc clusters regurarily.
Any pointer on how to solve this issue or if there is a better workaround out there would be greatly appreciated!