Fixed
Status Update
Comments
bl...@google.com <bl...@google.com>
jd...@gmail.com <jd...@gmail.com> #2
Edit:
- Spurious line "65 / ~350" after the first paragraph
- Spurious line "65 / ~350" after the first paragraph
b....@gmail.com <b....@gmail.com> #3
+1
vi...@google.com <vi...@google.com> #4
Could you perhaps use REGEX to parse the json string? Something like this should work (with some modifications for your use case):
WITH
yourTable AS (
SELECT
'{"bar": ["vimota", ""]}' AS json
UNION ALL
SELECT
'{"bar": [, "Brazil"]}' )
SELECT
ARRAY(
SELECT
REGEXP_EXTRACT(num, r'"(.*)"')
FROM
UNNEST(SPLIT(REGEXP_EXTRACT(JSON_EXTRACT(json,
'$.bar'), r'\[(.*)\]'))) AS num
WHERE
REGEXP_EXTRACT(num, r'"(.*)"') IS NOT NULL)
FROM
yourTable;
WITH
yourTable AS (
SELECT
'{"bar": ["vimota", ""]}' AS json
UNION ALL
SELECT
'{"bar": [, "Brazil"]}' )
SELECT
ARRAY(
SELECT
REGEXP_EXTRACT(num, r'"(.*)"')
FROM
UNNEST(SPLIT(REGEXP_EXTRACT(JSON_EXTRACT(json,
'$.bar'), r'\[(.*)\]'))) AS num
WHERE
REGEXP_EXTRACT(num, r'"(.*)"') IS NOT NULL)
FROM
yourTable;
jd...@gmail.com <jd...@gmail.com> #5
Nope, many of our json arrays contain json string values with user-input chars like " which would break a regex-based approach to parsing the json, since we'd have to distinguish " from \" from \\" from \\\", etc.
vi...@google.com <vi...@google.com>
el...@google.com <el...@google.com> #6
Thanks for the feedback! We'll take this suggestion into account as we plan JSON-related functionality, and I'll update here if and when there is more to share.
jd...@gmail.com <jd...@gmail.com> #7
Thanks! In the meantime, what's the best way to turn a json array into a bq array? Looking through the docs on json functions I don't see a way to achieve this, other than writing a custom javascript udf, which imposes the strict limitations of queries that use udfs.
el...@google.com <el...@google.com> #8
The best option right now--if you need to take escaping into account--is using a JavaScript UDF. If you generally have a small number of JSON array elements and you want to handle escaped strings, you could use a hack like this one:
CREATE TEMP FUNCTION JsonExtractArray(json STRING) AS (
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JSON_EXTRACT_SCALAR(json, '$.foo[0]'),
JSON_EXTRACT_SCALAR(json, '$.foo[1]'),
JSON_EXTRACT_SCALAR(json, '$.foo[2]'),
JSON_EXTRACT_SCALAR(json, '$.foo[3]'),
JSON_EXTRACT_SCALAR(json, '$.foo[4]'),
JSON_EXTRACT_SCALAR(json, '$.foo[5]'),
JSON_EXTRACT_SCALAR(json, '$.foo[6]'),
JSON_EXTRACT_SCALAR(json, '$.foo[7]'),
JSON_EXTRACT_SCALAR(json, '$.foo[8]'),
JSON_EXTRACT_SCALAR(json, '$.foo[9]')]) AS v)
);
Even though there is an escaped quote inside the "bar" string in this example, you'll get the expected three elements:
SELECT JsonExtractArray('{"foo":[1,2,3,"ba\\"r"]}');
CREATE TEMP FUNCTION JsonExtractArray(json STRING) AS (
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JSON_EXTRACT_SCALAR(json, '$.foo[0]'),
JSON_EXTRACT_SCALAR(json, '$.foo[1]'),
JSON_EXTRACT_SCALAR(json, '$.foo[2]'),
JSON_EXTRACT_SCALAR(json, '$.foo[3]'),
JSON_EXTRACT_SCALAR(json, '$.foo[4]'),
JSON_EXTRACT_SCALAR(json, '$.foo[5]'),
JSON_EXTRACT_SCALAR(json, '$.foo[6]'),
JSON_EXTRACT_SCALAR(json, '$.foo[7]'),
JSON_EXTRACT_SCALAR(json, '$.foo[8]'),
JSON_EXTRACT_SCALAR(json, '$.foo[9]')]) AS v)
);
Even though there is an escaped quote inside the "bar" string in this example, you'll get the expected three elements:
SELECT JsonExtractArray('{"foo":[1,2,3,"ba\\"r"]}');
jd...@gmail.com <jd...@gmail.com> #9
Yeah, hardcoding a max length on the input arrays is a non starter for us.
jd...@gmail.com <jd...@gmail.com> #11
> Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.
We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
> Use a JavaScript UDF that takes the input JSON and returns the desired type; this is fairly straightforward but generally uses more CPU (and hence may require a higher billing tier).
(Discussed above.)
> Use SQL functions with the understanding that the solution breaks down if there are too many elements.
(Discussed above.)
We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
> Use a JavaScript UDF that takes the input JSON and returns the desired type; this is fairly straightforward but generally uses more CPU (and hence may require a higher billing tier).
(Discussed above.)
> Use SQL functions with the understanding that the solution breaks down if there are too many elements.
(Discussed above.)
[Deleted User] <[Deleted User]> #12
We have a similar issue with map stored in JSON, parsing via REGEX is rather error prone. Right now seems JavaScript UDF is the only option and as mentioned before I'm fearing performance issues. In our case it's up to ~1M rows with each row containing map encoded as JSON (up to ~100 key-value pairs, might become more later).
Should I open a separate ticket for json_extract_map: string -> map<string, string> ?
Should I open a separate ticket for json_extract_map: string -> map<string, string> ?
el...@gmail.com <el...@gmail.com> #13
Yes, please do (this is more along the lines of supporting a new type). Thanks!
[Deleted User] <[Deleted User]> #15
>> Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.
> We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
I think what the StackOverflow user, Elliott Brossard, was proposing is that instead of using an ELT pattern, use an ETL pattern, with DataProc/DataFlow as your transformation technology/layer.
Basically:
1. Extract from source into Google Cloud Storage. (E)
2. Run a DataProc/DataFlow job to parse your data, and transform it as necessary. (T)
3. Write the result(s) to BigQuery. (L)
> We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.
I think what the StackOverflow user, Elliott Brossard, was proposing is that instead of using an ELT pattern, use an ETL pattern, with DataProc/DataFlow as your transformation technology/layer.
Basically:
1. Extract from source into Google Cloud Storage. (E)
2. Run a DataProc/DataFlow job to parse your data, and transform it as necessary. (T)
3. Write the result(s) to BigQuery. (L)
[Deleted User] <[Deleted User]> #16
Another option is to add an STRING_TO_ARRAY() function, as we already have the reverse one: ARRAY_TO_STRING()
It should basically do this:
regexp_extract_all(json_extract(FIELD, '$.keyWithArrayAsVal'), '{{[^}}]+}}')
It should basically do this:
regexp_extract_all(json_extract(FIELD, '$.keyWithArrayAsVal'), '{{[^}}]+}}')
[Deleted User] <[Deleted User]> #17
Any updates on this?
el...@gmail.com <el...@gmail.com> #18
Not yet. The best workarounds are those listed above, e.g. a JavaScript UDF or splitting with a regex, assuming the strings don't have escaped quotes in them.
od...@actionforresults.com <od...@actionforresults.com> #19
+1 for this -- would be extraordinary helpful for occasions where you may use a 3rd party tool that integrates with BigQuery but where you can't control how the data arrives (e.g. as a string).
A prime example of such a tool is Segment.com:
- can be configured to use Redshift or BigQuery as a data warehouse
- stringifies arrays before sending to warehouse
A prime example of such a tool is Segment.com:
- can be configured to use Redshift or BigQuery as a data warehouse
- stringifies arrays before sending to warehouse
bc...@gmail.com <bc...@gmail.com> #20
+1 for this.
ro...@gmail.com <ro...@gmail.com> #21
+100 for this.
Please implement a aolution for thia.
Please implement a aolution for thia.
an...@gmail.com <an...@gmail.com> #22
+1 for this.
[Deleted User] <[Deleted User]> #23
+1 for this
pa...@juul.com <pa...@juul.com> #24
+1 for this
[Deleted User] <[Deleted User]> #25
+1 for this
mi...@gmail.com <mi...@gmail.com> #26
+1
an...@phenixrts.com <an...@phenixrts.com> #27
+1
el...@google.com <el...@google.com>
ja...@gmail.com <ja...@gmail.com> #28
+1
It seems insane that in late 2019, BigQuery can't unnest a stringified JSON array without resorting to performance-breaking hacks. What the heck Google?
It seems insane that in late 2019, BigQuery can't unnest a stringified JSON array without resorting to performance-breaking hacks. What the heck Google?
th...@google.com <th...@google.com>
[Deleted User] <[Deleted User]> #29
Unnest JSON Arrays is blocking us to migrate our application completely into the cloud.
[Deleted User] <[Deleted User]> #30
+1
ga...@octane11.com <ga...@octane11.com> #31
+1
yv...@extrahop.com <yv...@extrahop.com> #32
+1
ro...@outfit7.com <ro...@outfit7.com> #33
+1
I ran into this issue again today. I was able to use a workaround using REGEXP_EXTRACT, but now I have to teach this hack (and associated pitfalls and limitations) to the whole team.
I ran into this issue again today. I was able to use a workaround using REGEXP_EXTRACT, but now I have to teach this hack (and associated pitfalls and limitations) to the whole team.
[Deleted User] <[Deleted User]> #34
+1
nf...@gmail.com <nf...@gmail.com> #35
+1
What I wouldn't give for a JSON_EXTRACT_ARRAY function. JSON_EXTRACT already allows array access by index, and JSON_EXTRACT_SCALAR will actually return NULL if the result is an array (or an object), so it seems safe to assume there are already means within those functions to parse JSON arrays - can we not expose arrays natively?
In addition, pretty much every ETL platform converts JSON columns into strings within BigQuery, but if the data in those columns can not be readily converted into an array, BigQuery becomes a real handicap in the ETL process. I appreciate any consideration here.
What I wouldn't give for a JSON_EXTRACT_ARRAY function. JSON_EXTRACT already allows array access by index, and JSON_EXTRACT_SCALAR will actually return NULL if the result is an array (or an object), so it seems safe to assume there are already means within those functions to parse JSON arrays - can we not expose arrays natively?
In addition, pretty much every ETL platform converts JSON columns into strings within BigQuery, but if the data in those columns can not be readily converted into an array, BigQuery becomes a real handicap in the ETL process. I appreciate any consideration here.
xi...@gmail.com <xi...@gmail.com> #36
+1
rc...@indodanafinance.com <rc...@indodanafinance.com> #37
+1
gl...@adaptavist.com <gl...@adaptavist.com> #38
+1
sh...@gmail.com <sh...@gmail.com> #39
+1
rb...@hioscar.com <rb...@hioscar.com> #40
+1
[Deleted User] <[Deleted User]> #41
+1 this would be really helpful.
sa...@karagonen.com <sa...@karagonen.com> #42
+1
[Deleted User] <[Deleted User]> #43
+1
[Deleted User] <[Deleted User]> #44
+1
ee...@migros.com.tr <ee...@migros.com.tr> #45
+1
[Deleted User] <[Deleted User]> #46
+1 !
zd...@gmail.com <zd...@gmail.com> #47
just tried to create UDF function called json_extract_array but it reported an error that "User-defined function name 'json_extract_array' conflicts with a reserved built-in function name". Unfortunately, it's not usable yet :) so it looks like there is some progress.
[Deleted User] <[Deleted User]> #48
+1
[Deleted User] <[Deleted User]> #49
+1
[Deleted User] <[Deleted User]> #50
+1
sa...@karagonen.com <sa...@karagonen.com> #51
I think the function shouldn't return only string. It can be like cast functions. So, I'd like to see a syntax like
json_extract_array(json_array as data_type)
So if it's int64, I should be able to use it like
json_extract_array(json_array as int64)
And of course, default data_type parameter can be string.
[Deleted User] <[Deleted User]> #52
+1
[Deleted User] <[Deleted User]> #53
+1
[Deleted User] <[Deleted User]> #54
+1
am...@gmail.com <am...@gmail.com> #55
+1
vh...@gmail.com <vh...@gmail.com> #56
+1
vh...@gmail.com <vh...@gmail.com> #57
ja...@google.com <ja...@google.com> #59
Thanks for all the upvotes.
Description
65 / ~350
```
create temp function json_extract_array(s string, key string) returns array<string> language js as """
try {
var xs = JSON.parse(s)[key];
return xs == null ? null : xs.filter((x,i) => x != null).map((x,i) => x.toString());
} catch (e) {
throw e + ', on input string s: ' + s;
}
"""
```
In particular, we're starting to see persistent query timeouts on a table with:
- ~5M rows
- ~350 columns
- ~65 are strings representing json arrays, which we're parsing all of using our `json_extract_array` udf
Questions:
1. Is there a better way to achieve this than using a udf?
2. Is this already on the team's internal roadmap?