Schema Auto Detection Interprets Quoted Integers as Integers Instead of Strings [395822783]

Assigned

Bug

Status Update

No update yet.

Description

le...@zeliq.com

created issue #1

Feb 11, 2025 04:06PM

Hello,

I'm encountering a frustrating issue with BigQuery's Schema Auto Detection while importing data.

When importing a file containing only containing things like : { "field": "123456" }

BigQuery interprets the value as an integer instead of a string, even though double quotes in JSON indicate a string.

When importing a large dataset via GCS, if the first processed file contains only null values and quoted numbers (e.g., "123456"), BigQuery may incorrectly infer the field type as INTEGER instead of STRING. As a result, future STRING values won't be writable because the field type is locked as INTEGER.

Thanks!

all_seen_as_integer.json

36 B

Download

all_seen_as_string.json

67 B

Download

all_seen_as_integer2.json

56 B

Download

Comments

va...@google.com <va...@google.com> Feb 12, 2025 06:24AM

Assigned to ja...@google.com.

ja...@google.com <ja...@google.com> #2Feb 12, 2025 02:34PM

Hello,

Thank you for reaching out to us!

To assist us in conducting thorough investigation, we kindly request your cooperation in providing the following information regarding the reported issue:

Do you see this issue constantly or intermittently?
Is there any specific scenario or time at which this issue is observed?
To help us understand the issue better, please provide detailed steps to reliably reproduce the problem.
It would be greatly helpful if you could attach screenshots of the output/error related to this issue.

Your cooperation in providing these details will enable us to dive deeper into the matter and work towards a prompt resolution. We appreciate your assistance and look forward to resolving this issue for you.

Thank you for your understanding and cooperation.

le...@zeliq.com <le...@zeliq.com> #3Feb 12, 2025 02:55PM

Hello,

1. This is the first time I have encountered this issue, and it is still present today.
2. I don't know. It seems to be a bug that needs to be fixed, as it won't go away otherwise.
3. Go to BigQuery, create a table, import the file I shared in my initial post, and you will see that a field containing a string is incorrectly assigned as INTEGER because it contains numbers—even though it is quoted (which indicates a string).
4. I'm sharing a screenshot of a large import that fails because a field is assigned as INTEGER instead of STRING, even though it should be a string. The issue is that a STRING cannot be written because the field is incorrectly set as INTEGER. The problem occurs because BigQuery initially processes a file where a field contains quoted numbers and incorrectly assigns it as INTEGER. As a result, the next file cannot be written since the schema expects an INTEGER instead of a STRING.

The reverse problem can also occur: If you first import a STRING, BigQuery will later reject a file containing quoted numbers, assuming the field should be an INTEGER.

TL;DR: BigQuery's Schema Auto Detect incorrectly assigns a field as INTEGER when a file contains only quoted numbers or null values (i.e., no quoted STRING).

Thanks for your support.

Wed Feb 12 2025 15:48:34 GMT+0100 (Central European Standard Time).png

57 KB

View

Download

ja...@google.com <ja...@google.com> #4Feb 13, 2025 03:43PM

Hello,

Thanks for your response.

Even though the numbers are in double quotes, BigQuery might still interpret them as integers if it detects a pattern. If the first file only has quoted numbers and nulls, it might think the column is always going to be an integer.

I would suggest you to create a small "dummy" file that contains at least one actual string value in the column. Make sure this file is processed first during the import. This will give BigQuery a better chance of inferring the correct data type.

I am waiting for your response.

le...@zeliq.com <le...@zeliq.com> #5Feb 13, 2025 04:10PM

Hello,

> I would suggest you create a small "dummy" file that contains at least one actual string value in the column.

No, that won’t work, because the issue goes both ways:

1. If the first file contains a STRING, the column is set to STRING. But as soon as BigQuery encounters a file where the column contains only `null` or quoted number (e.g., `"1234"`), it considers that the field type has changed.
2. If the first file contains only quoted numbers (which should still be treated as strings), BigQuery wrongly assigns the column type as INTEGER, even though quoted numbers are strings, not integers.

This behavior contradicts JSON.org itself : (

https://www.json.org/json-en.html)

"A value can be a STRING in DOUBLE QUOTES, or a number, or true, or false, or null, or an object, or an array. These structures can be nested."

> Make sure this file is processed first during the import.

That’s not a reliable solution, because with BigQuery Schema Auto Detection, you can’t control the processing order. Trying to work around this defeats the purpose of schema auto-detection.

BigQuery should never assign INTEGER when encountering quoted numbers, this is a bug.

ja...@google.com <ja...@google.com> #6Feb 14, 2025 05:27AM

Reassigned to gc...@google.com.

Hello,

Thank you for reaching out to us with your request.

We have duly noted your feedback and will thoroughly validate it. While we cannot provide an estimated time of implementation or guarantee the fulfillment of the issue, please be assured that your input is highly valued. Your feedback enables us to enhance our products and services.

We appreciate your continued trust and support in improving our Google Cloud Platform products. In case you want to report a new issue, please do not hesitate to create a new issue on the Issue Tracker providing a detailed description of your issue.

Once again, we sincerely appreciate your valuable feedback; Thank you for your understanding and collaboration.

Issue 395822783

Description

Issue summary

Comments

va...@google.com <va...@google.com> Feb 12, 2025 06:24AM

ja...@google.com <ja...@google.com> #2Feb 12, 2025 02:34PM

le...@zeliq.com <le...@zeliq.com> #3Feb 12, 2025 02:55PM

ja...@google.com <ja...@google.com> #4Feb 13, 2025 03:43PM

le...@zeliq.com <le...@zeliq.com> #5Feb 13, 2025 04:10PM

ja...@google.com <ja...@google.com> #6Feb 14, 2025 05:27AM

Add comment

Issue metadata