Status Update
Comments
se...@google.com <se...@google.com> #2
[Deleted User] <[Deleted User]> #3
+1
je...@google.com <je...@google.com>
mi...@google.com <mi...@google.com> #4
za...@sadan.me <za...@sadan.me> #5
+1
iv...@gmail.com <iv...@gmail.com> #6
ta...@plaid.co.jp <ta...@plaid.co.jp> #7
ba...@httparchive.org <ba...@httparchive.org> #8
I've noticed one weird thing. With an admin/table owner account I get a more accurate estimate. While any pother account get's the inaccurate one.
For example the following gives me an estimate of 42.29 GB when running as a table owner account:
SELECT DISTINCT
date,
client,
url
FROM
`httparchive.all.requests`,
UNNEST (response_headers) as resp_headers
WHERE
date = "2023-04-01" AND
LOWER(resp_headers.name) = 'vary' AND
LOWER(resp_headers.value) LIKE '%x-wf-forwarded-proto%' AND
is_main_document
(the table is partitioned on date and clustered on is_main_document)
Running as a general account with no particular permissions on this project estimates 3.08 TB. Despite the fact that this account can also see the clustering details.
In both cases 42.29 GB is what is actually used when the run is completed.
Why is the estimate different, and is there any extra permissions we can give to provide access to the more accurate estimate?
ak...@tutelatechnologies.com <ak...@tutelatechnologies.com> #9
Can this be confirmed?
pu...@google.com <pu...@google.com>
bv...@gmail.com <bv...@gmail.com> #10
This one also gives an estimate of 8Tb while actually it costs 420Mb:
WITH t1 AS (
SELECT ANY_VALUE("") AS col1
FROM `httparchive.all.pages`
WHERE date = "2024-04-01" -- partitioning
), t2 AS (
SELECT ANY_VALUE(custom_metrics) --heavy column
FROM `httparchive.all.pages`
WHERE date = "2024-04-01" -- partitioning
AND rank = 1000 -- clustering
)
SELECT *
FROM t1
JOIN t2 ON TRUE
Estimation for both of the CTEs (separately) is done correctly.
Doing UNION ALL also gives the same wrong estimation:
SELECT ANY_VALUE("")
FROM `httparchive.all.pages`
WHERE date = "2024-04-01" -- partitioning
UNION ALL
SELECT ANY_VALUE(custom_metrics) -- heavy column
FROM `httparchive.all.pages`
WHERE date = "2024-04-01" -- partitioning
AND rank = 1000 -- clustering
Description
As per[1]:
In a clustered table, BigQuery automatically sorts the data based on the values in the clustering columns and organizes them in optimally sized storage blocks. You can achieve more finely grained sorting by creating a table that is clustered and partitioned. A clustered table maintains the sort properties in the context of each operation that modifies it. As a result, BigQuery might not be able to accurately estimate the bytes processed by the query or the query costs.
However, in some cases the estimated bytes shown is the size of the full table scan and does not take into account the partition size.
Therefore, assuming this is provided as best effort basis, there still might be some room for improvement on Bigquery dry_run calculations.
[1]https://cloud.google.com/bigquery/docs/clustered-tables#clustering_partitioned_tables