Assigned
Status Update
Comments
at...@google.com <at...@google.com> #2
+1
ba...@google.com <ba...@google.com> #3
+1
to...@beatdapp.com <to...@beatdapp.com> #4
This affected our team as well. We had to implement a custom solution to get around this (simple script that did batch load jobs in 10k chunks). Not a big deal but still a bit of a time-waste and frustrating limitation to run into.
ng...@thedatasherpas.com <ng...@thedatasherpas.com> #5
ad...@strise.ai <ad...@strise.ai> #6
+1 on that, should be prioritized
vi...@backmarket.com <vi...@backmarket.com> #7
+1 on this ticket. It's a really frustrating limitation, specially because it's not described on the Limitations item on https://cloud.google.com/bigquery/docs/cloud-storage-transfer .
bl...@gmail.com <bl...@gmail.com> #8
+1, very frustrating.
I tried to split the DT into 2 via code, let's say one for the previous years and one for the future, but the wildcards/regex with numerical values are not supported so I can't state anything like the following: s3://example_bucket/example_folder/example_file202[1-2]*
I tried to split the DT into 2 via code, let's say one for the previous years and one for the future, but the wildcards/regex with numerical values are not supported so I can't state anything like the following: s3://example_bucket/example_folder/example_file202[1-2]*
je...@lapaireglasses.com <je...@lapaireglasses.com> #9
+1 on your side too.
Description
What you would like to accomplish:
When performing a data transfer from GCS to BigQuery through BigQuery Data Transfers, when we have more than 10,000 files per transfer, we get the following error:
"Transfer Run limits exceeded. Max size: 15.00 TB. Max file count: 10000. Found: size = 3733587305 B (0.00 TB) ; file count = 113222."
We would like to see a feature added where when more than 10,000 files are detected, BigQuery automatically splits them into multiple load jobs.
If applicable, reasons why alternative solutions are not sufficient:
Our use case produces more than 10,000 files per hour, which is the minimum possible interval to run a transfer from GCS to BigQuery.
Other solutions such a Dataflow would require coding, but we prefer using a fully managed service.