Add function json_extract_array: string -> array<string> [63716683]

Fixed

Feature Request

Status Update

No update yet.

Description

jd...@gmail.com

created issue #1

Jul 14, 2017 11:52PM

Using JSON_EXTRACT and JSON_EXTRACT_SCALAR I see no way to turn a `string` representing a json array into an `array<string>` representing each json value in the json array. Right now, we use a js udf to achieve this, which makes our queries slow (we're starting to see query timeouts on our bigger tables) and limits our query concurrency by a factor of 10x (6 concurrent udf queries instead of the usual 50):

65 / ~350

```
create temp function json_extract_array(s string, key string) returns array<string> language js as """
try {
var xs = JSON.parse(s)[key];
return xs == null ? null : xs.filter((x,i) => x != null).map((x,i) => x.toString());
} catch (e) {
throw e + ', on input string s: ' + s;
}
"""
```

In particular, we're starting to see persistent query timeouts on a table with:
- ~5M rows
- ~350 columns
- ~65 are strings representing json arrays, which we're parsing all of using our `json_extract_array` udf

Questions:
1. Is there a better way to achieve this than using a udf?
2. Is this already on the team's internal roadmap?

Comments

bl...@google.com <bl...@google.com> Jul 14, 2017 11:53PM

Assigned to vi...@google.com.

jd...@gmail.com <jd...@gmail.com> #2Jul 14, 2017 11:54PM

Edit:
- Spurious line "65 / ~350" after the first paragraph

b....@gmail.com <b....@gmail.com> #3Jul 15, 2017 12:08AM

vi...@google.com <vi...@google.com> #4Jul 15, 2017 12:49AM

Could you perhaps use REGEX to parse the json string? Something like this should work (with some modifications for your use case):

WITH
yourTable AS (
SELECT
'{"bar": ["vimota", ""]}' AS json
UNION ALL
SELECT
'{"bar": [, "Brazil"]}' )
SELECT
ARRAY(
SELECT
REGEXP_EXTRACT(num, r'"(.*)"')
FROM
UNNEST(SPLIT(REGEXP_EXTRACT(JSON_EXTRACT(json,
'$.bar'), r'\[(.*)\]'))) AS num
WHERE
REGEXP_EXTRACT(num, r'"(.*)"') IS NOT NULL)
FROM
yourTable;

jd...@gmail.com <jd...@gmail.com> #5Jul 15, 2017 12:53AM

Nope, many of our json arrays contain json string values with user-input chars like " which would break a regex-based approach to parsing the json, since we'd have to distinguish " from \" from \\" from \\\", etc.

vi...@google.com <vi...@google.com> Jul 15, 2017 01:02AM

Reassigned to vm...@google.com.

el...@google.com <el...@google.com> #6Jul 15, 2017 02:59AM

Reassigned to el...@google.com.

Thanks for the feedback! We'll take this suggestion into account as we plan JSON-related functionality, and I'll update here if and when there is more to share.

jd...@gmail.com <jd...@gmail.com> #7Jul 17, 2017 04:50PM

Thanks! In the meantime, what's the best way to turn a json array into a bq array? Looking through the docs on json functions I don't see a way to achieve this, other than writing a custom javascript udf, which imposes the strict limitations of queries that use udfs.

el...@google.com <el...@google.com> #8Jul 19, 2017 12:37AM

The best option right now--if you need to take escaping into account--is using a JavaScript UDF. If you generally have a small number of JSON array elements and you want to handle escaped strings, you could use a hack like this one:

CREATE TEMP FUNCTION JsonExtractArray(json STRING) AS (
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JSON_EXTRACT_SCALAR(json, '$.foo[0]'),
JSON_EXTRACT_SCALAR(json, '$.foo[1]'),
JSON_EXTRACT_SCALAR(json, '$.foo[2]'),
JSON_EXTRACT_SCALAR(json, '$.foo[3]'),
JSON_EXTRACT_SCALAR(json, '$.foo[4]'),
JSON_EXTRACT_SCALAR(json, '$.foo[5]'),
JSON_EXTRACT_SCALAR(json, '$.foo[6]'),
JSON_EXTRACT_SCALAR(json, '$.foo[7]'),
JSON_EXTRACT_SCALAR(json, '$.foo[8]'),
JSON_EXTRACT_SCALAR(json, '$.foo[9]')]) AS v)
);

Even though there is an escaped quote inside the "bar" string in this example, you'll get the expected three elements:

SELECT JsonExtractArray('{"foo":[1,2,3,"ba\\"r"]}');

jd...@gmail.com <jd...@gmail.com> #9Jul 19, 2017 06:22PM

Yeah, hardcoding a max length on the input arrays is a non starter for us.

[Deleted User] <[Deleted User]> #10Jul 26, 2017 04:54AM

check -

https://stackoverflow.com/a/45315826/5221944 for another option

jd...@gmail.com <jd...@gmail.com> #11Jul 26, 2017 05:13PM

> Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.

We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.

> Use a JavaScript UDF that takes the input JSON and returns the desired type; this is fairly straightforward but generally uses more CPU (and hence may require a higher billing tier).

(Discussed above.)

> Use SQL functions with the understanding that the solution breaks down if there are too many elements.

(Discussed above.)

[Deleted User] <[Deleted User]> #12Sep 8, 2017 04:25PM

We have a similar issue with map stored in JSON, parsing via REGEX is rather error prone. Right now seems JavaScript UDF is the only option and as mentioned before I'm fearing performance issues. In our case it's up to ~1M rows with each row containing map encoded as JSON (up to ~100 key-value pairs, might become more later).

Should I open a separate ticket for json_extract_map: string -> map<string, string> ?

el...@gmail.com <el...@gmail.com> #13Sep 8, 2017 05:38PM

Yes, please do (this is more along the lines of supporting a new type). Thanks!

[Deleted User] <[Deleted User]> #14Sep 8, 2017 07:54PM

ok, opened

https://issuetracker.google.com/issues/65488665

[Deleted User] <[Deleted User]> #15Nov 29, 2017 07:17AM

>> Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.
> We've been taking advantage of bigquery to follow an ELT (extract-load-transfer) pattern, where the T happens in bigquery sql itself, so adding another T step like ETLT would be a heavy and undesirable change for us.

I think what the StackOverflow user, Elliott Brossard, was proposing is that instead of using an ELT pattern, use an ETL pattern, with DataProc/DataFlow as your transformation technology/layer.

Basically:

1. Extract from source into Google Cloud Storage. (E)
2. Run a DataProc/DataFlow job to parse your data, and transform it as necessary. (T)
3. Write the result(s) to BigQuery. (L)

Message last modified on Nov 29, 2017 07:17AM

[Deleted User] <[Deleted User]> #16Dec 20, 2017 02:49PM

Another option is to add an STRING_TO_ARRAY() function, as we already have the reverse one: ARRAY_TO_STRING()

It should basically do this:
regexp_extract_all(json_extract(FIELD, '$.keyWithArrayAsVal'), '{{[^}}]+}}')

Message last modified on Dec 20, 2017 02:50PM

[Deleted User] <[Deleted User]> #17Apr 6, 2018 12:14PM

Any updates on this?

el...@gmail.com <el...@gmail.com> #18Apr 6, 2018 02:59PM

Not yet. The best workarounds are those listed above, e.g. a JavaScript UDF or splitting with a regex, assuming the strings don't have escaped quotes in them.

od...@actionforresults.com <od...@actionforresults.com> #19Oct 25, 2018 10:38AM

+1 for this -- would be extraordinary helpful for occasions where you may use a 3rd party tool that integrates with BigQuery but where you can't control how the data arrives (e.g. as a string).

A prime example of such a tool is Segment.com:
- can be configured to use Redshift or BigQuery as a data warehouse
- stringifies arrays before sending to warehouse

bc...@gmail.com <bc...@gmail.com> #20Nov 30, 2018 03:23AM

+1 for this.

ro...@gmail.com <ro...@gmail.com> #21Jan 5, 2019 09:07AM

+100 for this.
Please implement a aolution for thia.

an...@gmail.com <an...@gmail.com> #22Apr 20, 2019 02:54PM

+1 for this.

[Deleted User] <[Deleted User]> #23May 31, 2019 07:02PM

+1 for this

pa...@juul.com <pa...@juul.com> #24Jun 13, 2019 02:07PM

+1 for this

[Deleted User] <[Deleted User]> #25Jul 19, 2019 09:13PM

+1 for this

mi...@gmail.com <mi...@gmail.com> #26Sep 18, 2019 09:15AM

an...@phenixrts.com <an...@phenixrts.com> #27Sep 24, 2019 06:54PM

el...@google.com <el...@google.com> Oct 21, 2019 02:04PM

Reassigned to bo...@google.com.

ja...@gmail.com <ja...@gmail.com> #28Nov 28, 2019 05:25PM

+1

It seems insane that in late 2019, BigQuery can't unnest a stringified JSON array without resorting to performance-breaking hacks. What the heck Google?

th...@google.com <th...@google.com> Dec 2, 2019 06:18PM

Reassigned to ja...@google.com.

[Deleted User] <[Deleted User]> #29Dec 4, 2019 01:24PM

Unnest JSON Arrays is blocking us to migrate our application completely into the cloud.

[Deleted User] <[Deleted User]> #30Dec 18, 2019 12:55PM

ga...@octane11.com <ga...@octane11.com> #31Jan 3, 2020 08:55PM

yv...@extrahop.com <yv...@extrahop.com> #32Jan 13, 2020 11:21PM

ro...@outfit7.com <ro...@outfit7.com> #33Jan 14, 2020 02:40PM

+1

I ran into this issue again today. I was able to use a workaround using REGEXP_EXTRACT, but now I have to teach this hack (and associated pitfalls and limitations) to the whole team.

[Deleted User] <[Deleted User]> #34Jan 15, 2020 12:59AM

nf...@gmail.com <nf...@gmail.com> #35Jan 16, 2020 05:42PM

+1

What I wouldn't give for a JSON_EXTRACT_ARRAY function. JSON_EXTRACT already allows array access by index, and JSON_EXTRACT_SCALAR will actually return NULL if the result is an array (or an object), so it seems safe to assume there are already means within those functions to parse JSON arrays - can we not expose arrays natively?

In addition, pretty much every ETL platform converts JSON columns into strings within BigQuery, but if the data in those columns can not be readily converted into an array, BigQuery becomes a real handicap in the ETL process. I appreciate any consideration here.

xi...@gmail.com <xi...@gmail.com> #36Jan 22, 2020 10:26PM

rc...@indodanafinance.com <rc...@indodanafinance.com> #37Jan 24, 2020 12:50PM

gl...@adaptavist.com <gl...@adaptavist.com> #38Jan 29, 2020 02:00PM

sh...@gmail.com <sh...@gmail.com> #39Feb 8, 2020 05:04AM

rb...@hioscar.com <rb...@hioscar.com> #40Feb 13, 2020 03:31PM

[Deleted User] <[Deleted User]> #41Feb 26, 2020 01:56AM

+1 this would be really helpful.

sa...@karagonen.com <sa...@karagonen.com> #42Mar 2, 2020 01:18PM

[Deleted User] <[Deleted User]> #43Mar 6, 2020 07:18PM

[Deleted User] <[Deleted User]> #44Mar 13, 2020 12:49AM

ee...@migros.com.tr <ee...@migros.com.tr> #45Mar 30, 2020 12:27PM

[Deleted User] <[Deleted User]> #46Apr 8, 2020 01:24AM

+1 !

zd...@gmail.com <zd...@gmail.com> #47Apr 12, 2020 10:23PM

just tried to create UDF function called json_extract_array but it reported an error that "User-defined function name 'json_extract_array' conflicts with a reserved built-in function name". Unfortunately, it's not usable yet :) so it looks like there is some progress.

[Deleted User] <[Deleted User]> #48Apr 16, 2020 03:17PM

[Deleted User] <[Deleted User]> #49Apr 17, 2020 07:52PM

[Deleted User] <[Deleted User]> #50Apr 20, 2020 04:55PM

sa...@karagonen.com <sa...@karagonen.com> #51Apr 21, 2020 06:13AM

I think the function shouldn't return only string. It can be like cast functions. So, I'd like to see a syntax like

json_extract_array(json_array as data_type)
So if it's int64, I should be able to use it like
json_extract_array(json_array as int64)

And of course, default data_type parameter can be string.

[Deleted User] <[Deleted User]> #52Apr 23, 2020 05:02PM

[Deleted User] <[Deleted User]> #53Apr 27, 2020 07:52PM

[Deleted User] <[Deleted User]> #54Apr 29, 2020 06:00AM

am...@gmail.com <am...@gmail.com> #55Apr 29, 2020 07:14PM

vh...@gmail.com <vh...@gmail.com> #56May 2, 2020 01:34AM

vh...@gmail.com <vh...@gmail.com> #57May 2, 2020 06:35AM

Did this literally just get released?

I'm seeing documentation for a native JSON_EXTRACT_ARRAY function now here:

https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_extract_array

ja...@google.com <ja...@google.com> #58May 2, 2020 06:54AM

Marked as fixed.

https://cloud.google.com/bigquery/docs/release-notes#May_01_2020

ja...@google.com <ja...@google.com> #59May 2, 2020 07:05AM

Thanks for all the upvotes.

Issue 63716683

Description

Issue summary

Comments

bl...@google.com <bl...@google.com> Jul 14, 2017 11:53PM

jd...@gmail.com <jd...@gmail.com> #2Jul 14, 2017 11:54PM

b....@gmail.com <b....@gmail.com> #3Jul 15, 2017 12:08AM

vi...@google.com <vi...@google.com> #4Jul 15, 2017 12:49AM

jd...@gmail.com <jd...@gmail.com> #5Jul 15, 2017 12:53AM

vi...@google.com <vi...@google.com> Jul 15, 2017 01:02AM

el...@google.com <el...@google.com> #6Jul 15, 2017 02:59AM

jd...@gmail.com <jd...@gmail.com> #7Jul 17, 2017 04:50PM

el...@google.com <el...@google.com> #8Jul 19, 2017 12:37AM

jd...@gmail.com <jd...@gmail.com> #9Jul 19, 2017 06:22PM

[Deleted User] <[Deleted User]> #10Jul 26, 2017 04:54AM

jd...@gmail.com <jd...@gmail.com> #11Jul 26, 2017 05:13PM

[Deleted User] <[Deleted User]> #12Sep 8, 2017 04:25PM

el...@gmail.com <el...@gmail.com> #13Sep 8, 2017 05:38PM

[Deleted User] <[Deleted User]> #14Sep 8, 2017 07:54PM

[Deleted User] <[Deleted User]> #15Nov 29, 2017 07:17AM

[Deleted User] <[Deleted User]> #16Dec 20, 2017 02:49PM

[Deleted User] <[Deleted User]> #17Apr 6, 2018 12:14PM

el...@gmail.com <el...@gmail.com> #18Apr 6, 2018 02:59PM

od...@actionforresults.com <od...@actionforresults.com> #19Oct 25, 2018 10:38AM

bc...@gmail.com <bc...@gmail.com> #20Nov 30, 2018 03:23AM

ro...@gmail.com <ro...@gmail.com> #21Jan 5, 2019 09:07AM

an...@gmail.com <an...@gmail.com> #22Apr 20, 2019 02:54PM

[Deleted User] <[Deleted User]> #23May 31, 2019 07:02PM

pa...@juul.com <pa...@juul.com> #24Jun 13, 2019 02:07PM

[Deleted User] <[Deleted User]> #25Jul 19, 2019 09:13PM

mi...@gmail.com <mi...@gmail.com> #26Sep 18, 2019 09:15AM

an...@phenixrts.com <an...@phenixrts.com> #27Sep 24, 2019 06:54PM

el...@google.com <el...@google.com> Oct 21, 2019 02:04PM

ja...@gmail.com <ja...@gmail.com> #28Nov 28, 2019 05:25PM

th...@google.com <th...@google.com> Dec 2, 2019 06:18PM

[Deleted User] <[Deleted User]> #29Dec 4, 2019 01:24PM

[Deleted User] <[Deleted User]> #30Dec 18, 2019 12:55PM

ga...@octane11.com <ga...@octane11.com> #31Jan 3, 2020 08:55PM

yv...@extrahop.com <yv...@extrahop.com> #32Jan 13, 2020 11:21PM

ro...@outfit7.com <ro...@outfit7.com> #33Jan 14, 2020 02:40PM

[Deleted User] <[Deleted User]> #34Jan 15, 2020 12:59AM

nf...@gmail.com <nf...@gmail.com> #35Jan 16, 2020 05:42PM

xi...@gmail.com <xi...@gmail.com> #36Jan 22, 2020 10:26PM

rc...@indodanafinance.com <rc...@indodanafinance.com> #37Jan 24, 2020 12:50PM

gl...@adaptavist.com <gl...@adaptavist.com> #38Jan 29, 2020 02:00PM

sh...@gmail.com <sh...@gmail.com> #39Feb 8, 2020 05:04AM

rb...@hioscar.com <rb...@hioscar.com> #40Feb 13, 2020 03:31PM

[Deleted User] <[Deleted User]> #41Feb 26, 2020 01:56AM

sa...@karagonen.com <sa...@karagonen.com> #42Mar 2, 2020 01:18PM

[Deleted User] <[Deleted User]> #43Mar 6, 2020 07:18PM

[Deleted User] <[Deleted User]> #44Mar 13, 2020 12:49AM

ee...@migros.com.tr <ee...@migros.com.tr> #45Mar 30, 2020 12:27PM

[Deleted User] <[Deleted User]> #46Apr 8, 2020 01:24AM

zd...@gmail.com <zd...@gmail.com> #47Apr 12, 2020 10:23PM

[Deleted User] <[Deleted User]> #48Apr 16, 2020 03:17PM

[Deleted User] <[Deleted User]> #49Apr 17, 2020 07:52PM

[Deleted User] <[Deleted User]> #50Apr 20, 2020 04:55PM

sa...@karagonen.com <sa...@karagonen.com> #51Apr 21, 2020 06:13AM

[Deleted User] <[Deleted User]> #52Apr 23, 2020 05:02PM

[Deleted User] <[Deleted User]> #53Apr 27, 2020 07:52PM

[Deleted User] <[Deleted User]> #54Apr 29, 2020 06:00AM

am...@gmail.com <am...@gmail.com> #55Apr 29, 2020 07:14PM

vh...@gmail.com <vh...@gmail.com> #56May 2, 2020 01:34AM

vh...@gmail.com <vh...@gmail.com> #57May 2, 2020 06:35AM

ja...@google.com <ja...@google.com> #58May 2, 2020 06:54AM

ja...@google.com <ja...@google.com> #59May 2, 2020 07:05AM

Add comment

Issue metadata