Obsolete
Status Update
Comments
al...@google.com <al...@google.com> #2
Hello,
I understand your issue is that the upload using the Python libraries to GS buckets and loading a table to BigQuery is comparatively slower than the upload using the commands gsutil and bq. Please let me know if I have misunderstood.
Let me clarify that the comparision should be done this way: Python's blob.upload_from_file() vs gsutil command and Python's load_table_from_file() vs bq command. Once that is clear, I would like to ask you for the codes that you use so I can reproduce the situation myself and get further insights. Please remove all the personal information from your codes before sharing them.
I will wait for your response,
Manuel Alaman
Google Cloud Big Data Support Barcelona
I understand your issue is that the upload using the Python libraries to GS buckets and loading a table to BigQuery is comparatively slower than the upload using the commands gsutil and bq. Please let me know if I have misunderstood.
Let me clarify that the comparision should be done this way: Python's blob.upload_from_file() vs gsutil command and Python's load_table_from_file() vs bq command. Once that is clear, I would like to ask you for the codes that you use so I can reproduce the situation myself and get further insights. Please remove all the personal information from your codes before sharing them.
I will wait for your response,
Manuel Alaman
Google Cloud Big Data Support Barcelona
ke...@timosstudios.com <ke...@timosstudios.com> #3
You are correct.
Attached python script will generate a test csv file and conduct the python client test. Please find and replace all occurrences of `UPDATE_THIS` text.
It also has the DDL query you'll need to use to create the BQ table before you run the script.
Additionally, it has the exact bq command you'll need to test the bq CLI utility against the same file.
I just tested again after creating this using python 3.6.9, google-cloud-bigquery 2.20.0, and BigQuery CLI 2.0.69 (most recent versions). I still see the same performance difference (~ 4MBps upload from the python client, vs ~70MBps upload for the same file to the same table using BigQuery CLI.
Let me know if you need anything else.
Attached python script will generate a test csv file and conduct the python client test. Please find and replace all occurrences of `UPDATE_THIS` text.
It also has the DDL query you'll need to use to create the BQ table before you run the script.
Additionally, it has the exact bq command you'll need to test the bq CLI utility against the same file.
I just tested again after creating this using python 3.6.9, google-cloud-bigquery 2.20.0, and BigQuery CLI 2.0.69 (most recent versions). I still see the same performance difference (~ 4MBps upload from the python client, vs ~70MBps upload for the same file to the same table using BigQuery CLI.
Let me know if you need anything else.
ke...@timosstudios.com <ke...@timosstudios.com> #4
Hey there any update on this?
sa...@google.com <sa...@google.com> #5
Hi Kevin,
We are still investigating the issue. At this point we obtained [1] for the script and [2] for the bq command, where the “Upload complete” was achieved in about 11 seconds.
Further updates will be published here.
[1]
2021-06-30 06:55:01,496 root test_uploads INFO: Beginning load job...
2021-06-30 06:57:08,662 root test_uploads INFO: Job ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
2021-06-30 06:57:08,662 root test_uploads INFO: BQ load job complete without error!
[2]
Upload complete.
Waiting on bqjob_XXXXXXXXXXXXXXXXX_XXXXXXXXXXXXXXXX_X ... (48s) Current status: DONE
We are still investigating the issue. At this point we obtained [1] for the script and [2] for the bq command, where the “Upload complete” was achieved in about 11 seconds.
Further updates will be published here.
[1]
2021-06-30 06:55:01,496 root test_uploads INFO: Beginning load job...
2021-06-30 06:57:08,662 root test_uploads INFO: Job ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
2021-06-30 06:57:08,662 root test_uploads INFO: BQ load job complete without error!
[2]
Upload complete.
Waiting on bqjob_XXXXXXXXXXXXXXXXX_XXXXXXXXXXXXXXXX_X ... (48s) Current status: DONE
ke...@timosstudios.com <ke...@timosstudios.com> #6
Hi there, has there been any progress on this? Should I move this over to an Issue at https://github.com/googleapis/google-cloud-python ?
re...@google.com <re...@google.com>
jo...@google.com <jo...@google.com> #8
Since this is being investigated in github, having this issue also here seems like a duplicate. Let's close this and follow the fix on github.
Description
Problem you have encountered:
Using python google.cloud.bigquery.client.Client load_table_from_file() or uploading them first to my storage bucket using google.cloud.storage.blob.Blob upload_from_file() (to then load to BQ) both result in awfully slow uploads (1-4 MBps) when gsutil/Dropbox show 10x speeds on same machine/environment.
What you expected to happen:
I expected the python clients to be able to upload files to BQ/Storage using the full bandwidth available to my machine.
Steps to reproduce:
Attempt to upload a large (1-4GB) CSV file to an existing BQ table using the python biquery client load_table_from_file() method.
Same limited speed can be observed when trying to use the python storage client blob.upload_from_file().
Other information (workarounds you have tried, documentation consulted, etc):
I am using python 3.6.9. All metrics are tested using a single file upload. I have tried:
Running in a docker container on a Google Compute Engine Ubuntu VM.
Running in a docker container on my mac.
Running on my Mac using just python (no docker).
Uploading from the whole file in memory, from disk, uncompressed, gzipped. No difference.
Using older and the most recent python client library versions.
For older (1.24.0 Bigquery, 1.25.0 Storage) clients I see 1-3MB per second upload speeds. For the 2.13.1 Bigquery client I see 3-4MB per second.
All these tests resulted in identically slow performance.
I have 900+ Mbps up/down on my mac. The Dropbox python client library running on the same setups easily smokes these speeds using the exact same files, and using gsutil on my mac also shows 10x + speeds for the same file.