Implement a flag to allow users to ignore files undergoing modification when querying external tables. [407559502]

Assigned

Feature Request

Status Update

No update yet.

Description

lu...@google.com

created issue #1

Mar 31, 2025 06:09PM

This will create a feature request which anybody can view and comment on.

Please describe your requested enhancement. Good feature requests will solve common problems or enable new use cases.

What you would like to accomplish:

An issue arises when querying an external table that is continuously updated with data written to a date-stamped folder. Intermittent query failures occur due to a "NOT FOUND" error, specifically when the query targets a file currently undergoing modification. Successful execution is contingent upon the completion of the write operation. However, the continuous nature of data updates creates only a narrow timeframe during which the query can avoid files actively being written to. Therefore, a feature is needed to allow users to configure the system to ignore files undergoing modification at the time of query. This feature should be implemented via a flag or setting to maintain backward compatibility for existing users while enabling the new behavior option.

How this might work:

A proposed solution involves the implementation of a flag or setting. This would enable users to configure the system to ignore files undergoing modification at the time of query. Such an implementation would maintain backward compatibility for existing users while providing the option for new behavior activation via the flag/setting.

Other information (workarounds you have tried, documentation consulted, etc):

[1] https://cloud.google.com/bigquery/docs/external-tables#limitations

Comments

va...@google.com <va...@google.com> Apr 1, 2025 06:06AM

Assigned to ja...@google.com.

ja...@google.com <ja...@google.com> #2Apr 1, 2025 02:02PM

Hello,

Thank you for reaching out to us!

To assist us in conducting thorough investigation, we kindly request your cooperation in providing the following information regarding the reported issue:

Please provide detailed steps to reliably reproduce the problem.
It would be greatly helpful if you could attach screenshots of the output related to this issue.

Your cooperation in providing these details will enable us to dive deeper into the matter and work towards a prompt resolution. We appreciate your assistance and look forward to resolving this issue for you.

Thank you for your understanding and cooperation.

sa...@broadcom.com <sa...@broadcom.com> #3Apr 2, 2025 12:26AM

Scenario is like this:
A data provider team/user is writing data to a GCS location. In my case data is written in JSONL format in a partitioned manner.
Example JSONL data can be written in GCS as follows:

gs://my-gcs-bucket/incident-events/event_date=2025-03-20/...
gs://my-gcs-bucket/incident-events/event_date=2025-03-21/..
gs://my-gcs-bucket/incident-events/event_date=2025-03-22/..
gs://my-gcs-bucket/incident-events/event_date=2025-03-23/..
gs://my-gcs-bucket/incident-events/event_date=2025-03-24/..
gs://my-gcs-bucket/incident-events/event_date=2025-03-25/..

Where the base location is: gs://my-gcs-bucket/incident-events/
After that there is a partition field called event_date and data is getting written under the current date location only.

Now there is another data reader team/user who wants to analyze this data (historical data plus the new incoming data as it is coming in). Data provider and data reader teams are independent. Writer continues to write at its own time (mostly all the time) and readers can read the data any time.

In my GCS bucket Object Versioning is turned off.

We have put a BigQuery external table on top of this data and the data reader team uses the table to query the data.

Issue: When the data reader user tries to read the data from the BigQuery table for the current date the query is successful sometimes but most of the times the user experiences the error as follows:
Not found: Files gs://my-gcs-bucket/incident-events/event_date=2025-03-25/psr-34TtwL74Gf-400.json (Version 1742927008537775)

Example query:
--assuming the current date is 2025-03-25. replace with the current date
SELECT count(*) FROM 'my_gcp_project_id.my_dataset_name.my_external_table' where event_date = '2025-03-25';

Root Cause: It is because the writer user is mutating some objects under the current date partition.
Request is for BigQuery to give user an option to turn on so that the files undergoing mutation currently can be ignored during the query time and return result based on the objects that have already been finalized.

ja...@google.com <ja...@google.com> #4Apr 3, 2025 11:08AM

Reassigned to gc...@google.com.

Hello,

Thank you for reaching out to us with your request.

We have duly noted your feedback and will thoroughly validate it. While we cannot provide an estimated time of implementation or guarantee the fulfillment of the issue, please be assured that your input is highly valued. Your feedback enables us to enhance our products and services.

We appreciate your continued trust and support in improving our Google Cloud Platform products. In case you want to report a new issue, please do not hesitate to create a new issue on the Issue Tracker providing a detailed description of your issue.

Once again, we sincerely appreciate your valuable feedback; Thank you for your understanding and collaboration.

Issue 407559502

Description

Issue summary

Comments

va...@google.com <va...@google.com> Apr 1, 2025 06:06AM

ja...@google.com <ja...@google.com> #2Apr 1, 2025 02:02PM

sa...@broadcom.com <sa...@broadcom.com> #3Apr 2, 2025 12:26AM

ja...@google.com <ja...@google.com> #4Apr 3, 2025 11:08AM

Add comment

Issue metadata