Assigned
Status Update
Comments
jk...@google.com <jk...@google.com>
ko...@gmail.com <ko...@gmail.com>
ch...@gmail.com <ch...@gmail.com> #2
Edit:
- Spurious line "65 / ~350" after the first paragraph
- Spurious line "65 / ~350" after the first paragraph
ko...@gmail.com <ko...@gmail.com> #3
+1
ch...@gmail.com <ch...@gmail.com> #4
Could you perhaps use REGEX to parse the json string? Something like this should work (with some modifications for your use case):
WITH
yourTable AS (
SELECT
'{"bar": ["vimota", ""]}' AS json
UNION ALL
SELECT
'{"bar": [, "Brazil"]}' )
SELECT
ARRAY(
SELECT
REGEXP_EXTRACT(num, r'"(.*)"')
FROM
UNNEST(SPLIT(REGEXP_EXTRACT(JSON_EXTRACT(json,
'$.bar'), r'\[(.*)\]'))) AS num
WHERE
REGEXP_EXTRACT(num, r'"(.*)"') IS NOT NULL)
FROM
yourTable;
WITH
yourTable AS (
SELECT
'{"bar": ["vimota", ""]}' AS json
UNION ALL
SELECT
'{"bar": [, "Brazil"]}' )
SELECT
ARRAY(
SELECT
REGEXP_EXTRACT(num, r'"(.*)"')
FROM
UNNEST(SPLIT(REGEXP_EXTRACT(JSON_EXTRACT(json,
'$.bar'), r'\[(.*)\]'))) AS num
WHERE
REGEXP_EXTRACT(num, r'"(.*)"') IS NOT NULL)
FROM
yourTable;
dr...@gmail.com <dr...@gmail.com> #5
Nope, many of our json arrays contain json string values with user-input chars like " which would break a regex-based approach to parsing the json, since we'd have to distinguish " from \" from \\" from \\\", etc.
da...@gmail.com <da...@gmail.com> #6
Thanks for the feedback! We'll take this suggestion into account as we plan JSON-related functionality, and I'll update here if and when there is more to share.
ch...@gmail.com <ch...@gmail.com> #7
Thanks! In the meantime, what's the best way to turn a json array into a bq array? Looking through the docs on json functions I don't see a way to achieve this, other than writing a custom javascript udf, which imposes the strict limitations of queries that use udfs.
jo...@lumapps.com <jo...@lumapps.com> #8
The best option right now--if you need to take escaping into account--is using a JavaScript UDF. If you generally have a small number of JSON array elements and you want to handle escaped strings, you could use a hack like this one:
CREATE TEMP FUNCTION JsonExtractArray(json STRING) AS (
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JSON_EXTRACT_SCALAR(json, '$.foo[0]'),
JSON_EXTRACT_SCALAR(json, '$.foo[1]'),
JSON_EXTRACT_SCALAR(json, '$.foo[2]'),
JSON_EXTRACT_SCALAR(json, '$.foo[3]'),
JSON_EXTRACT_SCALAR(json, '$.foo[4]'),
JSON_EXTRACT_SCALAR(json, '$.foo[5]'),
JSON_EXTRACT_SCALAR(json, '$.foo[6]'),
JSON_EXTRACT_SCALAR(json, '$.foo[7]'),
JSON_EXTRACT_SCALAR(json, '$.foo[8]'),
JSON_EXTRACT_SCALAR(json, '$.foo[9]')]) AS v)
);
Even though there is an escaped quote inside the "bar" string in this example, you'll get the expected three elements:
SELECT JsonExtractArray('{"foo":[1,2,3,"ba\\"r"]}');
CREATE TEMP FUNCTION JsonExtractArray(json STRING) AS (
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JSON_EXTRACT_SCALAR(json, '$.foo[0]'),
JSON_EXTRACT_SCALAR(json, '$.foo[1]'),
JSON_EXTRACT_SCALAR(json, '$.foo[2]'),
JSON_EXTRACT_SCALAR(json, '$.foo[3]'),
JSON_EXTRACT_SCALAR(json, '$.foo[4]'),
JSON_EXTRACT_SCALAR(json, '$.foo[5]'),
JSON_EXTRACT_SCALAR(json, '$.foo[6]'),
JSON_EXTRACT_SCALAR(json, '$.foo[7]'),
JSON_EXTRACT_SCALAR(json, '$.foo[8]'),
JSON_EXTRACT_SCALAR(json, '$.foo[9]')]) AS v)
);
Even though there is an escaped quote inside the "bar" string in this example, you'll get the expected three elements:
SELECT JsonExtractArray('{"foo":[1,2,3,"ba\\"r"]}');
ch...@gmail.com <ch...@gmail.com> #9
Yeah, hardcoding a max length on the input arrays is a non starter for us.
to...@gmail.com <to...@gmail.com> #11
> Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.
We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
> Use a JavaScript UDF that takes the input JSON and returns the desired type; this is fairly straightforward but generally uses more CPU (and hence may require a higher billing tier).
(Discussed above.)
> Use SQL functions with the understanding that the solution breaks down if there are too many elements.
(Discussed above.)
We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
> Use a JavaScript UDF that takes the input JSON and returns the desired type; this is fairly straightforward but generally uses more CPU (and hence may require a higher billing tier).
(Discussed above.)
> Use SQL functions with the understanding that the solution breaks down if there are too many elements.
(Discussed above.)
ek...@google.com <ek...@google.com>
oa...@gmail.com <oa...@gmail.com> #12
We have a similar issue with map stored in JSON, parsing via REGEX is rather error prone. Right now seems JavaScript UDF is the only option and as mentioned before I'm fearing performance issues. In our case it's up to ~1M rows with each row containing map encoded as JSON (up to ~100 key-value pairs, might become more later).
Should I open a separate ticket for json_extract_map: string -> map<string, string> ?
Should I open a separate ticket for json_extract_map: string -> map<string, string> ?
jo...@gmail.com <jo...@gmail.com> #13
Yes, please do (this is more along the lines of supporting a new type). Thanks!
ek...@google.com <ek...@google.com>
se...@gmail.com <se...@gmail.com> #15
>> Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.
> We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
I think what the StackOverflow user, Elliott Brossard, was proposing is that instead of using an ELT pattern, use an ETL pattern, with DataProc/DataFlow as your transformation technology/layer.
Basically:
1. Extract from source into Google Cloud Storage. (E)
2. Run a DataProc/DataFlow job to parse your data, and transform it as necessary. (T)
3. Write the result(s) to BigQuery. (L)
> We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
I think what the StackOverflow user, Elliott Brossard, was proposing is that instead of using an ELT pattern, use an ETL pattern, with DataProc/DataFlow as your transformation technology/layer.
Basically:
1. Extract from source into Google Cloud Storage. (E)
2. Run a DataProc/DataFlow job to parse your data, and transform it as necessary. (T)
3. Write the result(s) to BigQuery. (L)
[Deleted User] <[Deleted User]> #16
Another option is to add an STRING_TO_ARRAY() function, as we already have the reverse one: ARRAY_TO_STRING()
It should basically do this:
regexp_extract_all(json_extract(FIELD, '$.keyWithArrayAsVal'), '{{[^}}]+}}')
It should basically do this:
regexp_extract_all(json_extract(FIELD, '$.keyWithArrayAsVal'), '{{[^}}]+}}')
al...@gmail.com <al...@gmail.com> #17
Any updates on this?
as...@gmail.com <as...@gmail.com> #18
Not yet. The best workarounds are those listed above, e.g. a JavaScript UDF or splitting with a regex, assuming the strings don't have escaped quotes in them.
da...@gmail.com <da...@gmail.com> #19
+1 for this -- would be extraordinary helpful for occasions where you may use a 3rd party tool that integrates with BigQuery but where you can't control how the data arrives (e.g. as a string).
A prime example of such a tool is Segment.com:
- can be configured to use Redshift or BigQuery as a data warehouse
- stringifies arrays before sending to warehouse
A prime example of such a tool is Segment.com:
- can be configured to use Redshift or BigQuery as a data warehouse
- stringifies arrays before sending to warehouse
ab...@gmail.com <ab...@gmail.com> #20
+1 for this.
br...@gmail.com <br...@gmail.com> #21
+100 for this.
Please implement a aolution for thia.
Please implement a aolution for thia.
dy...@gmail.com <dy...@gmail.com> #22
+1 for this.
Description
Stack overflow thread: