BigQuery dry_run estimations significantly off when clustering is applied [176795805]

Assigned

Feature Request

Status Update

No update yet.

Description

se...@google.com

created issue #1

Jan 5, 2021 12:59PM

As per[1]:

In a clustered table, BigQuery automatically sorts the data based on the values in the clustering columns and organizes them in optimally sized storage blocks. You can achieve more finely grained sorting by creating a table that is clustered and partitioned. A clustered table maintains the sort properties in the context of each operation that modifies it. As a result, BigQuery might not be able to accurately estimate the bytes processed by the query or the query costs.

However, in some cases the estimated bytes shown is the size of the full table scan and does not take into account the partition size.

Therefore, assuming this is provided as best effort basis, there still might be some room for improvement on Bigquery dry_run calculations.

[1] https://cloud.google.com/bigquery/docs/clustered-tables#clustering_partitioned_tables

Comments

se...@google.com <se...@google.com> #2Jan 5, 2021 01:01PM

Agreed, you can download photos via the Google Drive API but this hasn't been incorporated into the Photos API yet...

[Deleted User] <[Deleted User]> #3Jan 5, 2021 01:29PM

There's already an issue tracking the missing EXIF data here:

https://issuetracker.google.com/111228390

Regarding the file at original quality - this is something that's on our list and will be addressed soon, please stay tuned. I'll update this bug once we have an update to share.

je...@google.com <je...@google.com> Jan 19, 2021 06:46PM

Reassigned to mi...@google.com.

mi...@google.com <mi...@google.com> #4Apr 28, 2021 09:50PM

Reassigned to ja...@google.com.

We have just released a new version of the Google Photos Library API that supports now this feature.

You can now use the "d" base URL parameter to download the original photo. See the base URL parameter guide for more details:

https://developers.google.com/photos/library/guides/access-media-items#image-base-urls
Thanks for your patience!

See our release notes for further detail:

https://developers.google.com/photos/library/support/release-notes#2018-07-31

za...@sadan.me <za...@sadan.me> #5Oct 25, 2022 08:18PM

I've been testing downloading images using the "d" parameter, but the files returned are not the original files that were uploaded. The release note above reads "d download parameter, to download the original image". The link to the developer documentation doesn't say that the "original" image will be downloaded. Please can you confirm if the d parameter should download the original image that was uploaded to the api, or if this behavior has since changed.

iv...@gmail.com <iv...@gmail.com> #6Oct 26, 2022 08:28AM

I've tested this again, and the file download is a mutated version of the file uploaded to Photos. If I use the web browser and go to

photos.google.com, when I "Download" a photo it is byte-for-byte identical to the file I uploaded. If I use the API and attempt to get a copy, with the included "d" base URL, the file is mutated. Both some metadata is missing, as well the photo itself is modified (doing a pixel-for-pixel comparison shows subtle differences).

ta...@plaid.co.jp <ta...@plaid.co.jp> #7Oct 31, 2022 04:00PM

I can confirm this is still broken and using the 'd' parameter does not return the original photo.

ba...@httparchive.org <ba...@httparchive.org> #8May 1, 2023 05:26PM

This is still an issue

an...@opensignal.com <an...@opensignal.com> #9Jul 13, 2023 07:18PM

How is it marked as fixed? =d parameter still doesn't work

pu...@google.com <pu...@google.com> Oct 13, 2023 12:35PM

Reassigned to gc...@google.com.

bv...@gmail.com <bv...@gmail.com> #10May 22, 2024 06:50PM

This one also gives an estimate of 8Tb while actually it costs 420Mb:

WITH t1 AS (
  SELECT ANY_VALUE("") AS col1
  FROM `httparchive.all.pages`
  WHERE date = "2024-04-01" -- partitioning
), t2 AS (
  SELECT ANY_VALUE(custom_metrics) --heavy column
  FROM `httparchive.all.pages`
  WHERE date = "2024-04-01" -- partitioning
    AND rank = 1000 -- clustering
)
SELECT *
FROM t1
JOIN t2 ON TRUE

Estimation for both of the CTEs (separately) is done correctly.

Doing UNION ALL also gives the same wrong estimation:

SELECT ANY_VALUE("")
  FROM `httparchive.all.pages`
  WHERE date = "2024-04-01" -- partitioning
UNION ALL
  SELECT ANY_VALUE(custom_metrics) -- heavy column
  FROM `httparchive.all.pages`
  WHERE date = "2024-04-01" -- partitioning
    AND rank = 1000 -- clustering