Partitioning/Clustering on non-date field [35905817]

Assigned

Feature Request

Status Update

No update yet.

Description

de...@derekperkins.com

created issue #1

Sep 24, 2016 08:52PM

On most of my datasets, I have 2 query patterns, which I think are broadly applicable.
1) Date - good for broad recent queries
2) ID - good for narrow historical queries

I'm happy to pay to store my data twice, and in fact I am currently doing that in 10M tables. I would love to be able to benefit from the partitioning improvements rather than maintain them all manually.

Is key based partitioning coming up soon on the roadmap?

Thanks,
Derek

Comments

ep...@google.com <ep...@google.com> #2Sep 25, 2016 02:31AM

Assigned to ep...@google.com.

This error shows up when there is a module in your project whose .iml file does not contain:

external.system.id="GRADLE"

Can you please check your .iml files? Also, instead of opening the project, *import* it, that will completely rewrite your .iml files and you won't see that error again.

de...@derekperkins.com <de...@derekperkins.com> #3Sep 25, 2016 07:40AM

I can confirm that it works with AS 0.8.14 if I do

1) open AS,
2) delete the Gradle Java modules from the project,
3) re-import the Gradle Java modules to the project,
4) close AS,
5) re-open AS.

In that case AS does not *not* complain about Gradle Java modules to be non-Gradle Java modules, and I've confirmed that the generated *.iml files contain

external.system.id="GRADLE".

However, if I do

6) close AS,
7) delete the *.iml files,
8) re-open AS,

then AS again complains *although* the generates files again contain

external.system.id="GRADLE". It seems that the problem is related to "open" vs. "import". In the latter instructions, the *.iml files seem to get implicitly generated because the Gradle Java modules are referred to from setting.gradle.

Can it be that AS is somehow performing the check for non-Gradle Java modules before the *.iml files are generated in case of just opening (instead of importing) the project?

de...@derekperkins.com <de...@derekperkins.com> #4Sep 25, 2016 07:42AM

I see what is happening now. Yes, you are right. This is an issue with "open". I think I know how to fix this issue.

ke...@king.com <ke...@king.com> #5Mar 19, 2017 12:34AM

Thanks for the confirmation. On a more or less related note, as this would not have happened if I was committing the *.iml files, what is your recommendation WRT this? I know about [1] which recommends to share "All the .iml module files", but it's kind of annoying that these files get rewritten if e.g. dependencies change in build.gradle. In general, most of the *.iml files seem to be duplicate content WRT the build.gradle files and can be derived from them. So, should we commit *.iml files for Gradle Android modules to Git?

[1]

https://intellij-support.jetbrains.com/entries/23393067

ga...@gmail.com <ga...@gmail.com> #6Mar 21, 2017 12:02AM

It is better not to add iml files to source control. For regular IDEA projects is OK because .iml files are the source of truth for project configuration. In the case of Android projects, the source of truth is gradle.build files, and .iml are generated from them every time you sync your project with Gradle.

r....@gmail.com <r....@gmail.com> #7Apr 25, 2017 08:53AM

Thanks for the clear recommendation!

r....@gmail.com <r....@gmail.com> #8Apr 25, 2017 08:56AM

@5 As you seem to know what the problem is, would it be appropriate to change the status from "NeedsInfo" to "Accepted"?

jm...@gmail.com <jm...@gmail.com> #9Jun 29, 2017 06:44AM

This issue is fixed in Studio 1.2.

ep...@google.com <ep...@google.com> #10Jun 29, 2017 06:48AM

It also happens if there are comments in .iml file like:

r....@gmail.com <r....@gmail.com> #11Jun 29, 2017 09:57AM

I just received this error from AS 1.2.1.1 and I do not have any non-Gradle Java modules in my project.

I have one main android application + six Android libraries.

We do not commit our .iml files to source control.

ke...@king.com <ke...@king.com> #12Jun 29, 2017 11:33AM

I received the same error in version 1.3.2.

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #13Jul 16, 2017 09:47AM

I received this error on 2.1.2.

sw...@bainbridgehealth.com <sw...@bainbridgehealth.com> #14Jul 16, 2017 02:40PM

[Comment deleted]

bz...@gmail.com <bz...@gmail.com> #15Jul 16, 2017 03:27PM

received this error in 2.3 canary 2

pe...@gmail.com <pe...@gmail.com> #16Jul 16, 2017 03:30PM

I am receiving this error in AS 2.3 Canary 2 after doing File->Invalidate Caches->Restart. Mine is a CMake project. This error is not very useful, since the project eventually builds anyway.

bz...@gmail.com <bz...@gmail.com> #17Jul 16, 2017 03:49PM

[Comment deleted]

pe...@gmail.com <pe...@gmail.com> #18Jul 16, 2017 04:06PM

I am seeing this with AS 2.3 Beta 1 as well. Since this issue has been closed for some time, I went ahead and created a new issue @

https://code.google.com/p/android/issues/detail?id=230550&thanks=230550&ts=1482021669

ep...@google.com <ep...@google.com> #19Jul 16, 2017 06:52PM

#18 & #19: Did you happen to get this error after switching to Gradle 3.2?

And are you, by any chance, using the com.github.dcendents.android-maven plugin? I upgraded from 1.4.2 of that to 1.5, and updated to Gradle 3.2 at the same time, and I'm seeing the error even after downgrading to AS 2.2.3.

I'm in the process of doing a clean install now, and then I'll be trying reverting back to the older version of both, to see if that changes anything.

de...@derekperkins.com <de...@derekperkins.com> #20Jul 17, 2017 04:30AM

I don't know what's changed, but after a clean install, and removing ALL .iml files, after an initial complaint on opening the project, it rebuild all the .iml files and now seems fine. It's weird that last night, I tried making a brand-new project, and got the same results, but I'll take what I can get.

That said, it seems interesting to me that several people at once all seemed to run into the same thing. There may be something going on, still.

bi...@ilab.dk <bi...@ilab.dk> #21Aug 22, 2017 05:28PM

The issue is still exist in android studio 3.0.1

de...@derekperkins.com <de...@derekperkins.com> #22Aug 22, 2017 06:38PM

Same here on android studio 3.0.1

bz...@gmail.com <bz...@gmail.com> #23Aug 22, 2017 07:05PM

Same here on android studio 3.0.1

ma...@google.com <ma...@google.com> #24Oct 17, 2017 12:06AM

Same here on android studio 3.0.1

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #25Oct 17, 2017 08:29AM

The same in Android 3.1

la...@amedia.no <la...@amedia.no> #26Oct 21, 2017 09:19AM

also the same in Android 3.1

da...@gmail.com <da...@gmail.com> #27Nov 21, 2017 06:02PM

where can i find java gradle module?

jo...@gmail.com <jo...@gmail.com> #28Dec 17, 2017 12:39PM

same error in Android Studio 3.1.2

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #29Dec 17, 2017 12:49PM

Msg Error in Android Studio v 3.1.2 :
6:48 PM Unsupported Modules Detected: Compilation is not supported for following modules: AndroidStudioProjects. Unfortunately you can't have non-Gradle Java modules and Android-Gradle modules in one project.
I do not have any non-Gradle Java modules in my project.

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #30Dec 21, 2017 05:58PM

The error occurs also in:

Android Studio 3.1.4
Build #AI-173.4907809, built on July 23, 2018
JRE: 1.8.0_152-release-1024-b01 amd64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Linux 4.15.0-32-generic

I am not sure, but I believe this causes further errors which are quite problematic. I have several projects with this issue: in all of them, some dependencies that are defined in a module's build.gradle file cannot be resolved by AS. Actually, the dependencies are resolved and downloaded from the repositories and show up under "External Libraries" in Project view. However, everywhere I try to use any class from those dependencies... I get a "Cannot resolve symbol XYZ". So I end up stuck with unresolved imports, no way to navigate through classes, etc.

The wierdest thing, though, is that the app compiles fine and runs in a device. It's just AS cannot resolve symbols from well-resolved dependencies.

Project structure overview:
- "app" Android application module (Gradle); depends on "commons", "annotations" and "processor" modules -> *gets detected as non-Gradle*
- "commons" Android library module (Gradle) with common utils, views, dependencies, etc.
- "annotations" java module (Gradle)
- "processor" java module (Gradle)

The symbols that are not resolved by AS come from dependencies defined in the "commons" module.

I have invalidated caches, deleted .idea and .gradle folders, re-downloaded the project, deleted ~/.gradle/caches folder, deleted *.iml files... no luck whatsoever.

Any hints on this issue? Do you believe the "Unsupported Modules Detected" error has anything to do with the "Cannot resolve symbol" issue?

Thanks a lot and best regards.

[Deleted User] <[Deleted User]> #31Dec 21, 2017 08:19PM

Same error in 3.2.1

Did some research and this was the outcome after trying in Windows & Mac.

When importing a project you have a screen giving 2 options:
- Create project from existing sources
- Import project from external model

Inside the "Import project from external model" there are 2 more options:

- Android Gradle
- Gradle

If you select "Android Gradle" everything is fine, no false positive at all.
If you select Gradle you will get the false positive error message every time you open that project in Android Studio.

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #32Jan 3, 2018 11:07AM

same error in 3.5.1

ep...@google.com <ep...@google.com> #33Jan 4, 2018 06:17AM

Same error in 4.0.0

Message last modified on Jan 4, 2018 06:18AM

ya...@gmail.com <ya...@gmail.com> #34Jan 9, 2018 09:23AM

Same issue I am facing every single time I open my project in android studio 4.0.0

ep...@google.com <ep...@google.com> #35Jan 9, 2018 05:05PM

AS 4,0,1
for one of older project
deleted .idea and .gradle folders, re-downloaded the project, deleted ~/.gradle/caches folder, deleted *.iml files..
work ok.

ak...@gmail.com <ak...@gmail.com> #36Feb 16, 2018 02:24AM

fe...@lindenlab.com <fe...@lindenlab.com> #37Feb 26, 2018 05:25PM

This still happens with AS 4.0.1. All of my *.iml files specify '

external.system.id="GRADLE"', and the Java modules build
when I double-click on the "build" gradle task - but the IDE can't build the Java modules itself. Better to use a Makefile for
building I guess ...

ep...@google.com <ep...@google.com> Apr 2, 2018 06:52PM

Reassigned to hu...@google.com.

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #38Apr 12, 2018 02:29PM

who solved this issue on the latest versions of AS, please share your actions

Screenshot 2018-04-12 15.07.44.png

146 KB

View

Download

pe...@gmail.com <pe...@gmail.com> #39Apr 12, 2018 04:10PM

deleted

hu...@google.com <hu...@google.com> #40Apr 12, 2018 05:52PM

If you get this message it is unlikely to be the same old issue. Please file a new bug and, if possible, share your idea.log files (Help | Show Log...). Thank you!

ma...@gmail.com <ma...@gmail.com> #41Apr 24, 2018 10:41PM

this is happening here too, nothing works, I'm using Android Studio Ladybug Feature Drop | 2024.2.2 Canary 1 on a mac m2
bug reported

hu...@google.com <hu...@google.com> #42Apr 24, 2018 10:47PM

No, clustering is a different feature that we will do an alpha release soon. It's different from partitioning on non-date columns.

[Deleted User] <[Deleted User]> #43Apr 25, 2018 07:11AM

Interesting, so what does clustering in BigQuery give us? Partitioning support on two (nested) levels? That would be awesome!

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #44Apr 25, 2018 07:23AM

yes some info on this alpha clustering would be nice - is there a issue for this feature?

ma...@gmail.com <ma...@gmail.com> #45Apr 25, 2018 07:34AM

According to the gcloud bq command help, it says this:

--clustering_fields: Comma separated field names. Can only be specified with
time based partitioning. Data will be first partitioned and subsequently
"clustered on these fields.

ep...@google.com <ep...@google.com> #46Apr 25, 2018 08:09AM

This feature will be in very limited availability mode for some of our users in the near future. We will announce it here for whitelisting once we are ready to open it up. We don't yet have a good ETA to offer at this point, but we will have more information later this quarter.

sa...@gmail.com <sa...@gmail.com> #47Apr 25, 2018 09:04AM

To avoid another slew of whitelist requests, could you just consider
everyone who's requested whitelisting for the partitioning alpha to
also be requesting whitelist for the clustering alpha?

And to avoid issue tracker spam, could you give a form or email to
contact for whitelist requests? Maybe e.g. just a single google form
for all alpha feature whitelist requests, which you keep updated w/
whatever the current set of available alphas is — and notify anyone
who's requested any alpha before about new alphas available?

Remember, buganizer hides your email/name from us, though not vice
versa, so we can't email you offlist. Which, while on the subject,
does not seem very friendly to me. :-/

[Deleted User] <[Deleted User]> #48May 23, 2018 09:03PM

how do we add votes to this enhancement?

ma...@gmail.com <ma...@gmail.com> #49May 23, 2018 09:17PM

Vote by clicking the "star" icon at the top of the page next to the issue number. It shows how many other people also starred the issue.

ep...@google.com <ep...@google.com> May 23, 2018 09:18PM

Reassigned to ep...@google.com.

[Deleted User] <[Deleted User]> #50May 24, 2018 07:37AM

Hi

how do I do a whitelist request for the --clustering_fields feature?

ep...@google.com <ep...@google.com> #51May 24, 2018 05:17PM

We are waiting on a few changes to be rolled out before we can start accepting whitelist requests from a wider set of customers. I will post an update here once they are available.

ep...@google.com <ep...@google.com> #52Jun 20, 2018 08:58PM

We are accepting whitelist requests for the alpha release of the BigQuery clustering feature, through this form. Please take a look at it if you are interested in being part of the alpha.

https://docs.google.com/forms/d/1xwAfbMXuR_mzLlGvAiNFsLF0whAHK9OJ9f0AEVw264Y

Message last modified on Jul 6, 2018 05:50PM

[Deleted User] <[Deleted User]> #53Jul 5, 2018 07:34AM

Hi BigQuery team,
We have done a few small experiments with the clustering Alpha but do not see the advantages yest. Let me share you what we did / our findings:
-The basis is a user game / useractivity table that we've copied to a clustered table (time partitioned on firstTimeActivity, clustered on app_name

-When we now filter on firstTimeActivity, we have lower costs (logical, as we filter on time partitions)

-When we do a LIMIT X on the clustered table we see LOWER costs compared to doing a LIMIT X on the non partitioned and non clustered table. Even without any WHERE statements:
Job ID spil-bi:EU.bquijob_64230f64_1646953cea7

-However, when we filter on firstTimeActivity + app_name (the cluster key) we do NOT see reduced costs, NOR do we see a significant reduction in query time:
Job ID spil-bi:EU.bquijob_665faebd_164695a0583

ba...@aliz.ai <ba...@aliz.ai> #54Jul 5, 2018 08:48AM

(I might be wrong, but specific issues like this should be discussed here:

https://groups.google.com/forum/#!forum/bq-clustering-and-partitioning-feedback -> For example there were discussion there already about the reduced costs due to clustering.)

Message last modified on Jul 5, 2018 09:07AM

[Deleted User] <[Deleted User]> #55Jul 5, 2018 09:51AM

Thanks #54, asking the be admitted there...

[Deleted User] <[Deleted User]> #56Jul 5, 2018 11:33AM

*Hagit Ben Shoshan *

VP of Customer success , Data Scientist
Mobile *+972 (0) 52 420 6631 * | skype: hagit.ben.shoshan

On Thu, Jul 5, 2018 at 12:51 PM, <buganizer-system@google.com> wrote:

- Show quoted text -

ep...@google.com <ep...@google.com> #57Jul 5, 2018 06:09PM

Responding to

comment#53

:

I looked at the table in question there. There is only ~10MiB of data in each partition of the table. Clustering breaks the data further within each partition into blocks of some reasonable size (generally a few 100MiB). 10MiB of data per partition is too small for clustering to split. Billing applies at block granularity. This is one of the cases where partitioning differs from clustering. With clustering, BigQuery automatically infers splits the data into blocks, thus strict cost guarantees are not available (unlike partitioning which guarantees the partition boundaries). If you have order of a few GiB of data per partition, a query like yours will see cost reduction and performance improvement.

[Deleted User] <[Deleted User]> #58Jul 6, 2018 11:05AM

Thanks for the feedback.
Tested it with 3+ GB time partitions, and works now.
We are super happy about this!

tt...@monsanto.com <tt...@monsanto.com> #59Jul 6, 2018 12:54PM

Good Morning,

I have been attempting to apply for the alpha whitelist at the URL made available above:

https://docs.google.com/forms/d/1xwAfbMXuR_mzLlGvAiNFsLF0whAHK9OJ9f0AEVw264Y/edit

However, the site repeatedly asks me to sign-in, and even once I have signed in, it will not allow me to edit the form (and thus submit). Can you please advise me on how I can submit a request for whitelisting?

Thank you

ep...@google.com <ep...@google.com> #60Jul 6, 2018 05:51PM

Sorry, I included the /edit suffix on the URL in mistake. I have updated

comment#52

to remove it. Please give this one a try. Thanks!

https://docs.google.com/forms/d/1xwAfbMXuR_mzLlGvAiNFsLF0whAHK9OJ9f0AEVw264Y

ep...@google.com <ep...@google.com> #61Jul 31, 2018 02:52PM

Clustering is now publicly available (beta). No special whitelisting is necessary. Documentation is here:

https://cloud.google.com/bigquery/docs/clustered-tables

mi...@shopify.com <mi...@shopify.com> #62Aug 21, 2018 08:20PM

Is there beta or alpha flag support for partitioning by an integer or string?

ep...@google.com <ep...@google.com> #63Aug 21, 2018 08:43PM

Not yet, but we are actively working on it. Please watch for updates here.

Is there a reason why you cannot use clustering to achieve this? Partitioning offers some guarantees that clustering currently does not, but I am curious to know if your scenario really needs partitioning and clustering wouldn't suffice.

ke...@gmail.com <ke...@gmail.com> #64Aug 21, 2018 09:51PM

One obvious negative of clustering is increased load times. Another is the operational overhead of having to perform manual re-clustering from time to time. (Not really "No ops".) These things put people off using it, at least in my organisation.

There have been a number of questions in this issue about whether Google intends to support both a date partition and an integer/string partition _on the same table_ (ie, two-level partitioning). As far as I can see re-reading the entire thread, answering these questions has always been carefully avoided. ;)

[Deleted User] <[Deleted User]> #65Aug 22, 2018 04:09AM

I am curios about "manual re-clustering" you mentioned. Could you share more details? what and why do you do that?

Message last modified on Aug 22, 2018 04:10AM

ep...@google.com <ep...@google.com> #66Aug 22, 2018 05:23AM

Responding to #64:
We are aware of some issues with load time increase for clustering in certain scenarios and are working actively on improving its performance and are going to rollout more improvements in the near future. There is a some amount of cost to pay to arrange data in a way that makes queries (write once, read multiple times) efficient and cost-effective. That said, we have work ongoing to reduce the impact of this.

Clustering is our recommended mechanism to obtain two level partitioning. It offers finer grained partitioning without significant metadata maintenance overhead.

ke...@gmail.com <ke...@gmail.com> #67Aug 29, 2018 08:36PM

Responding to question #65, the BigQuery docs say:

“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted. In a partially sorted table, queries that use the clustering columns may need to scan more blocks compared to a table that is fully sorted. You can re-cluster the data in the entire table by running a SELECT * query that selects from and overwrites the table (or any specific partition in it). In addition, any arbitrary portion of the table can be re-clustered using a DML MERGE statement.”

In other words, you need to manually cause the data in a clustered table to be re-clustered from time to time, if you wish to retain the benefits of clustering. With a partitioned table, you don’t need to do this.

in...@gmail.com <in...@gmail.com> #68Oct 29, 2018 08:47AM

issue 35587298

is already getting responses regarding failures.
Closing the issue render.

ri...@gmail.com <ri...@gmail.com> #69Dec 18, 2018 02:05PM

This would definitely be a great feature to have. My company has a lot of data from various source accounts (eg. Google Analytics view IDs, Adform client IDs - typically strings or integers). We wish to load separate reports into one centralized table accessed by authorized views to make the administration of our dashboarding solutions easier to manage, rather than have multiple tables/datasets per client.

th...@google.com <th...@google.com> #70Feb 14, 2019 02:29PM

Hi, my customer is asking if we will support clustering on non-partitioned table or alternatively partitioning on non-date fields. Their main use-case is to replace indices that existed in the database they migrated from. Could we get any update on this ticket?

pe...@gmail.com <pe...@gmail.com> #71Feb 14, 2019 02:35PM

You can partition by a column where value is all NULL to get clustering enabled. (this is a little trick to use on non partitioned tables - you just need to create a dummy column and change to partitioned table).
Then you can use clustering on 5 columns as you define.

Message last modified on Feb 14, 2019 02:36PM

th...@google.com <th...@google.com> #72Feb 14, 2019 03:11PM

That is actually great to know, thank you!

I just tested it. Querying a null-partitioned clustered table specifying the cluster column in the where clause does reduce the amount of data read.

It still feels like there should be a cleaner way to do this however. Hopefully somebody from the product team can comment on this trick? :)

pe...@gmail.com <pe...@gmail.com> #73Feb 14, 2019 03:18PM

Sorry for using the word "trick". This is not a trick, it's an official way to leverage clustering when you don't have a date for your partition. You are able to partition on date column. A NULL is considered a cluster as well.

From docs:
Partitioned tables are subject to the following limitations:
The partitioning column must be either a scalar DATE or TIMESTAMP column. While the mode of the column may be REQUIRED or NULLABLE, it cannot be REPEATED (array-based)

so NULLABLE is there, and you can just read adding a NULL value to all rows means you have one single partition.

Message last modified on Feb 14, 2019 03:22PM

ya...@gmail.com <ya...@gmail.com> #74Feb 14, 2019 03:41PM

kudo to pe...@gmail.com !
Clustering is like ordering so this should work well when data don't change (much). Partitioning would be more efficient and effective but so would be materializing query result to mimic an index. All depends on the use case, query time/cost, cardinality,...
Still, after more than 2yrs, it would be good to hear back from the product team on this popular request.

ep...@google.com <ep...@google.com> #75Feb 14, 2019 06:44PM

We are currently running an alpha for partitioning on integer fields. If you are interested, I can send you an email with the details. Partitioning on string fields (list partitioning) is something we may consider in the future, but in such cases, our experience is that clustering works well.

Note that with clustering, even in the event of data appends we try to keep things clustered in the background. Partitioning on date + clustering is generally a good approach for some of our users. Once a date becomes inactive, we will try to get the partition to a fully clustered state (we are constantly making improvements here). Also, clustering can provide significant cost reduction when datasets are over a few GB (with a partitioned table, we require over a few GB of data within that partition for cost reduction to kick in).

ke...@gmail.com <ke...@gmail.com> #76Feb 14, 2019 07:07PM

Ooh, partitioning on both a date and an integer in the same table?

Or, if not, is that ever likely to be an option?

ya...@gmail.com <ya...@gmail.com> #77Feb 14, 2019 10:25PM

G - I am interested in the alpha, can you email me the details (or have an online form to register).

Would joins work on integer-based partitions ? Where needed, idea would be to use the integer as a hashcode of a string, or as uniqueID to a string stored as pairs in a master table. Note that the latter is possible using the current date-based partitions (date --> string).

hu...@google.com <hu...@google.com> #78Feb 15, 2019 07:49PM

Re Kenneth: no, either date or integer, but not both. No plan for supporting both.

Re Yannick: Sent you an email. What do you mean would joins work? You can sure join on an integer column.

aj...@gmail.com <aj...@gmail.com> #79Feb 28, 2019 09:35PM

Hello, I would also be very much interested in the details on an alpha for partitioning on integer fields if possible.

[Deleted User] <[Deleted User]> #80Mar 7, 2019 10:30PM

Would like to participate in alpha for partitioning on integer fields. Is there a form to make the request?

hu...@google.com <hu...@google.com> #81Mar 8, 2019 07:42AM

Here's the link to the form:

https://forms.gle/p9ug42tF8oyqPrug7. Thanks for your interest in trying out integer range partitioning.

[Deleted User] <[Deleted User]> #82Apr 11, 2019 06:06PM

Hello, will this "partitioning on integer fields" feature means you are not working on a "--time_partitioning_type=HOUR" option for time partitioned tables?

hu...@google.com <hu...@google.com> #83Apr 11, 2019 06:53PM

If you have too much data for each day and want to further partition the data for cost/performance, you can partition by the timestamp column and then cluster on it. BigQuery will automatically cluster the data in each day for improved performance and reduced cost.

[Deleted User] <[Deleted User]> #84Apr 11, 2019 10:16PM

Hi, no, my use case is not primarily about performance, but about atomic loading of data.
We manage data loading hour by hour, and we want to be able to load or re-load (in case of late data or corrupted data to reprocess) in an atomic manner.
Today, we can do that day by day (by atomically replacing an entire partition). To do it hour by hour, we have to use 1 table per hour, to have the same guarantee of atomicity. Having 1 table by hour is a pain to query and maintain.
I guess we could hack our way with integer partitioning (by having an int field representing a date like 2018123123), but a date partitioning with hour granularity instead of day would better fit our need.

hu...@google.com <hu...@google.com> #85Apr 11, 2019 10:40PM

I see, so whenever you load data for an hour, you want to delete the existing data and load the new data atomically. You can consider using the MERGE statement:

https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement.

[Deleted User] <[Deleted User]> #86Apr 15, 2019 08:21AM

Hello, I'm glad to see work on this moving forward and I just signed up for the alpha. I have three questions:

(1) with this feature enabled, what is the maximum number of partitions that a table will support?
(2) will it be possible to utilize two-level partitioning (i.e., first partition a table by date, and then by the integer field)?
(3) are you guys also considering supporting hour partitions? Integer partitioning is great, but in some use-cases hour partitions would be a more natural fit (as others have pointed out).

Thanks,
Conrad

hu...@google.com <hu...@google.com> #87Apr 16, 2019 09:55PM

Thanks for your interest, Conrad. I'll whitelist your project today.

Regarding your questions:
(1) The maximum number of partitions will be the same as time partitioning.
(2) Partitioning + clustering is our recommendation if you need to partition by multiple fields.
(3) There's no plan for hourly partitions. Alban gave a good use case for hourly partitions, but we believe most other cases can be satisfied by partitioning and clustering on the timestamp field.

[Deleted User] <[Deleted User]> #88Apr 17, 2019 08:51AM

Can I get confirmation that the maximum number of partitions per table is
currently 4000 as specified here:

https://cloud.google.com/bigquery/quotas --
there have been some other number floating around the issue trackers.

I agree that for many use-cases partitioning+clustering this is an ideal
solution, but I don't think it will work for some of mine -- see the
details here:

https://stackoverflow.com/questions/55723409/bigquery-do-clustered-tables-remain-sorted-in-the-face-of-streaming-inserts

On Tue, Apr 16, 2019 at 11:55 PM <buganizer-system@google.com> wrote:

- Show quoted text -

ep...@google.com <ep...@google.com> #89Apr 17, 2019 08:57AM

Maximum number of partitions per table is indeed 4000.

We are doing some a fair amount of work with streaming to keep the table clustered upto a certain recent time interval. We don't have a good ETA to offer on this at this point, but we hope to have more information on this soon. In general, Date partitioning + clustering is likely to work best where data is generally arriving for current date, since the system then doesn't have to recluster older dates often.

[Deleted User] <[Deleted User]> #90Apr 17, 2019 09:05AM

Hello, i filled the form to have the alpha version but i didn't receive a confirmation mail.
Do you have any paper related to this partitioning in BQ and do we have a way to figure out the bytes billed approximately before execution ?
is there any edge cases where small updates of the table change the cost of the same query significantly ?

Thanks,
Samir

hu...@google.com <hu...@google.com> #91Apr 17, 2019 07:34PM

hi Samir, your project is already whitelisted. I just resent the invitation to join the feedback group, and a link to the alpha user guide.

Billing-wise it works the same as time partitioning. You can find out the cost of the query through a dry-run. I can't think of any small updates that could increase the cost of a query significantly. Have you seen such cases on time partitioning?

[Deleted User] <[Deleted User]> #92Apr 18, 2019 08:20AM

I don't seem to have received the alpha user guide. Can you send it to my e-mail address? Thanks!

hu...@google.com <hu...@google.com> #93Apr 18, 2019 06:34PM

Conrad, looks like you're already in the feedback group so you didn't receive the invite. I sent the link to to your email. Thanks for trying it out.

[Deleted User] <[Deleted User]> #94Apr 18, 2019 07:11PM

same happens to me. I filled the form recently but did not get any link for group and docs. can you please check this

hu...@google.com <hu...@google.com> #95Apr 18, 2019 08:46PM

Sent to your email too. We've also updated the group welcome message to include the link.

ma...@icteam.it <ma...@icteam.it> #96Apr 23, 2019 09:07AM

I've the same problem, I filled the form two week ago, but I didn't receive any invite/email. Can you please check? Thank you

hu...@google.com <hu...@google.com> #97Apr 23, 2019 05:52PM

Re-sent the invitation out. Thanks.

ho...@google.com <ho...@google.com> #98Sep 6, 2019 09:41PM

Btw, I wrote an an answer here to setup partitioning by week/month/year, etc. That way I can get years of data into a single BigQuery table and get the benefits of clustering (which doesn't provide much benefits when each partition doesn't have enough data).

-

https://stackoverflow.com/a/56125049/132438

bj...@s-communication.de <bj...@s-communication.de> #99Jan 10, 2020 10:21AM

Happy New Year! Are there any news on that feature? I know, it is currently listed as Beta. Thank you for your feedback!

Message last modified on Jan 10, 2020 04:45PM

ra...@gmail.com <ra...@gmail.com> #100Jan 29, 2020 10:52AM

Hi, do you partition by strings in your roadmap? Presently, we are encoding string values in categorical values and using the translated INT as partition objective.

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #101May 7, 2020 08:56AM

In the meantime, does BQ team have a suggestion on simulating string partitions by converting to ints? Assume a reasonably sensible bounded set of known distinct string values (e.g 2000 unique event type names you might ingest)?

ma...@gmail.com <ma...@gmail.com> #102May 7, 2020 09:01AM

BigQuery supports hash functions:

https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions

You can use FarmHash to generate an INT64

pe...@gmail.com <pe...@gmail.com> #103May 7, 2020 09:03AM

Could you develop more your use case? Basically expand what is your use case, and expand why some options are better for you:
a) partition by event string +cluster by 4 other columns
b) partition by an arbitrary column + cluster by 4 columns (one is event string)

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #104May 7, 2020 09:31AM

I had a quick look at FarmHash, seems it can generate extremely large numbers and also negatives so needs some normalisation to get some consistency for the RANGE_BUCKET part of the partitonId......will take a deeper look at that function, is anyone else using this at all?

My use case is having trillions of events to query by eventName (so single string as the effective PK, but a bounded unique set at least), and these names are not all controlled by BQ, the master source system we ingest can introduce more event types with their releases. We could build a BQ lookup table from name -> int, but that requires maintenance.....Id want to for now look to use something like FarmHash (or other option) in the interim to do more dynamically.

I also wish the PK fields could be nested some levels (not arrays just nesting via structs e.g PK of b where sourced from struct a.b)

Thanks!

pe...@gmail.com <pe...@gmail.com> #105May 7, 2020 10:00AM

Do you need partitioning for these? You should use eventName as clustering parameter, it has the same cost benefits as if was a partitioning field. And in clustering you can use strings.

hu...@google.com <hu...@google.com> #106May 7, 2020 06:15PM

+1 for clustering. We're going to support clustering without partitioning soon.

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #107May 8, 2020 07:27PM

Comment #105

thanks, maybe Im missing something, but I need to specify something as a partitioning key (non string) before I can specify cluster keys right? (well, now at least sounds like the comment form #106 its coming soon anyway so +1 here too! - I want to search by eventName as the primary access pattern)

Cheers!

pe...@gmail.com <pe...@gmail.com> #108May 8, 2020 07:39PM

yes indeed, you need for partition key a column, that can be a fake one, with NULLs in it, then you can use clustering columns

[Deleted User] <[Deleted User]> #109Aug 7, 2020 08:12AM

Hi Google,

Are you planning in your roadmap to include string as a partition field?

Thanks.

ep...@google.com <ep...@google.com> #110Aug 7, 2020 03:59PM

Clustering is supported without partitioning on BigQuery tables now. Clustering is essentially range partitioning where ranges are determined by the system based on the amount of data in the table. You can cluster (i.e. range partition) a table by a string column.

[Deleted User] <[Deleted User]> #111Aug 10, 2020 09:05AM

So based on your answer I'm assuming partitioning by string is not in your plans, I guess. Is that right?

ya...@gmail.com <ya...@gmail.com> #112Aug 10, 2020 01:44PM

There are use cases to support partitioning instead of clustering though - as discussed months ago in the G beta user group - like dropping a partition instead of scanning for a cluster to delete.

If you have 1000 or so tables to "resize" e.g. replace a partition, dropping an object is immediate and free (DROP is a DDL) whereas scanning each table for deletion is slow and costly (DELETE is a DML).

The above assumes that dropping a partition through a DDL (or API) will be supported soon... which G had said is in the works.

Message last modified on Aug 10, 2020 01:47PM

bw...@google.com <bw...@google.com> #113Sep 22, 2020 01:32AM

Accepted by ja...@google.com.

This feature is not currently on our roadmap but we will keep it under consideration.

bw...@google.com <bw...@google.com> Sep 22, 2020 05:00AM

Assigned to ja...@google.com.

[Deleted User] <[Deleted User]> #114Sep 29, 2020 02:49PM

Thanks for your feedback.
Hope you guys consider adding it to your roadmap, this or next year.
Currently, due to this kind of limitation (including only 4000 partitions per table), we have one table per year (date partition) and one table per season (date partition as well), which is bad in terms of ETL, costs and maintenance, for obvious reasons.
With a string column as a partition field, we could be able to get rid of most of them.
Already tried the Farm-Fingerprint hash function but it doesn't work as expected, different string values are getting the same integer value.
Anyway, looking forward to see this feature in action... maybe one day.

ya...@gmail.com <ya...@gmail.com> #115Sep 29, 2020 03:58PM

#114, that's pretty weird as FARM_FINGERPRINT() is unique. The problem is using it for partitioning, which is currently not possible as the partition key is not known upfront.

Hash partitioning e.g. PARTITION BY HASH(salesman_id) would work for most of the use cases but there is no sign G is working on supporting this. Even the partition by LIST() is stalled.

[Deleted User] <[Deleted User]> #116Jan 12, 2021 08:40AM

Hi,

Any news regarding this subject? Anything on the roadmap for 2021?

Thanks.

[Deleted User] <[Deleted User]> #117May 4, 2021 01:14PM

Any news Google? Anything on the roadmap for this year?

Thanks in advance.

[Deleted User] <[Deleted User]> #118Oct 21, 2021 08:29AM

Any news regarding this subject?

ep...@google.com <ep...@google.com> #119Oct 21, 2021 02:57PM

Have you considered clustering? Clustering is fine grained partitioning
without any limits on the number of partitions. The system can
automatically determine the partitioning by a variety of column types and
supports composition of multiple columns.

We understand there are some special cases where users want control on the
partition boundaries and the ability to address partitions by name.
However, for many cases clustering does satisfy the requirements.

Worth noting is that while clustered tables have flexibility and almost
infinite scalability , the exact cost of query (bytes processed) is not
known upfront (dry run provided value) and we only provide an upper bound.
The cost at the end of the query does take partition pruning into account
and only charges for the blocks of data that BigQuery actually ends up
scanning.

[Deleted User] <[Deleted User]> #120Sep 13, 2022 01:05PM

any developments regarding this feature?
thanks.

ep...@google.com <ep...@google.com> #121Sep 13, 2022 03:41PM

Hi, Would you be willing to provide some feedback on my response in

https://buganizer.corp.google.com/issues/35905817#comment119 above? Thank you!

[Deleted User] <[Deleted User]> #122Sep 26, 2022 12:01PM

sh...@gmail.com <sh...@gmail.com> #123Dec 8, 2023 07:30AM

Any update over this request.....I need to have partitioned directories based on non date input field having multiple values.
Appreciate any sort of help

Message last modified on Dec 8, 2023 07:31AM

pr...@gmail.com <pr...@gmail.com> #124Feb 7, 2024 11:46AM

+1, We need non date based partitions while updating data in tables
Any idea on the progress when this will be made possible

Issue 35905817

Description

Issue summary

Comments

ep...@google.com <ep...@google.com> #2Sep 25, 2016 02:31AM

de...@derekperkins.com <de...@derekperkins.com> #3Sep 25, 2016 07:40AM

de...@derekperkins.com <de...@derekperkins.com> #4Sep 25, 2016 07:42AM

ke...@king.com <ke...@king.com> #5Mar 19, 2017 12:34AM

ga...@gmail.com <ga...@gmail.com> #6Mar 21, 2017 12:02AM

r....@gmail.com <r....@gmail.com> #7Apr 25, 2017 08:53AM

r....@gmail.com <r....@gmail.com> #8Apr 25, 2017 08:56AM

jm...@gmail.com <jm...@gmail.com> #9Jun 29, 2017 06:44AM

ep...@google.com <ep...@google.com> #10Jun 29, 2017 06:48AM

r....@gmail.com <r....@gmail.com> #11Jun 29, 2017 09:57AM

ke...@king.com <ke...@king.com> #12Jun 29, 2017 11:33AM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #13Jul 16, 2017 09:47AM

sw...@bainbridgehealth.com <sw...@bainbridgehealth.com> #14Jul 16, 2017 02:40PM

bz...@gmail.com <bz...@gmail.com> #15Jul 16, 2017 03:27PM

pe...@gmail.com <pe...@gmail.com> #16Jul 16, 2017 03:30PM

bz...@gmail.com <bz...@gmail.com> #17Jul 16, 2017 03:49PM

pe...@gmail.com <pe...@gmail.com> #18Jul 16, 2017 04:06PM

ep...@google.com <ep...@google.com> #19Jul 16, 2017 06:52PM

de...@derekperkins.com <de...@derekperkins.com> #20Jul 17, 2017 04:30AM

bi...@ilab.dk <bi...@ilab.dk> #21Aug 22, 2017 05:28PM

de...@derekperkins.com <de...@derekperkins.com> #22Aug 22, 2017 06:38PM

bz...@gmail.com <bz...@gmail.com> #23Aug 22, 2017 07:05PM

ma...@google.com <ma...@google.com> #24Oct 17, 2017 12:06AM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #25Oct 17, 2017 08:29AM

la...@amedia.no <la...@amedia.no> #26Oct 21, 2017 09:19AM

da...@gmail.com <da...@gmail.com> #27Nov 21, 2017 06:02PM

jo...@gmail.com <jo...@gmail.com> #28Dec 17, 2017 12:39PM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #29Dec 17, 2017 12:49PM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #30Dec 21, 2017 05:58PM

[Deleted User] <[Deleted User]> #31Dec 21, 2017 08:19PM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #32Jan 3, 2018 11:07AM

ep...@google.com <ep...@google.com> #33Jan 4, 2018 06:17AM

ya...@gmail.com <ya...@gmail.com> #34Jan 9, 2018 09:23AM

ep...@google.com <ep...@google.com> #35Jan 9, 2018 05:05PM

ak...@gmail.com <ak...@gmail.com> #36Feb 16, 2018 02:24AM

fe...@lindenlab.com <fe...@lindenlab.com> #37Feb 26, 2018 05:25PM

ep...@google.com <ep...@google.com> Apr 2, 2018 06:52PM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #38Apr 12, 2018 02:29PM

pe...@gmail.com <pe...@gmail.com> #39Apr 12, 2018 04:10PM

hu...@google.com <hu...@google.com> #40Apr 12, 2018 05:52PM

ma...@gmail.com <ma...@gmail.com> #41Apr 24, 2018 10:41PM

hu...@google.com <hu...@google.com> #42Apr 24, 2018 10:47PM

[Deleted User] <[Deleted User]> #43Apr 25, 2018 07:11AM

ku...@xiatech.co.uk <ku...@xiatech.co.uk> #44Apr 25, 2018 07:23AM

ma...@gmail.com <ma...@gmail.com> #45Apr 25, 2018 07:34AM

ep...@google.com <ep...@google.com> #46Apr 25, 2018 08:09AM

sa...@gmail.com <sa...@gmail.com> #47Apr 25, 2018 09:04AM

[Deleted User] <[Deleted User]> #48May 23, 2018 09:03PM

ma...@gmail.com <ma...@gmail.com> #49May 23, 2018 09:17PM

ep...@google.com <ep...@google.com> May 23, 2018 09:18PM

[Deleted User] <[Deleted User]> #50May 24, 2018 07:37AM

ep...@google.com <ep...@google.com> #51May 24, 2018 05:17PM

ep...@google.com <ep...@google.com> #52Jun 20, 2018 08:58PM

[Deleted User] <[Deleted User]> #53Jul 5, 2018 07:34AM

ba...@aliz.ai <ba...@aliz.ai> #54Jul 5, 2018 08:48AM

[Deleted User] <[Deleted User]> #55Jul 5, 2018 09:51AM

[Deleted User] <[Deleted User]> #56Jul 5, 2018 11:33AM

ep...@google.com <ep...@google.com> #57Jul 5, 2018 06:09PM

[Deleted User] <[Deleted User]> #58Jul 6, 2018 11:05AM

tt...@monsanto.com <tt...@monsanto.com> #59Jul 6, 2018 12:54PM

ep...@google.com <ep...@google.com> #60Jul 6, 2018 05:51PM

ep...@google.com <ep...@google.com> #61Jul 31, 2018 02:52PM

mi...@shopify.com <mi...@shopify.com> #62Aug 21, 2018 08:20PM

ep...@google.com <ep...@google.com> #63Aug 21, 2018 08:43PM

ke...@gmail.com <ke...@gmail.com> #64Aug 21, 2018 09:51PM

[Deleted User] <[Deleted User]> #65Aug 22, 2018 04:09AM

ep...@google.com <ep...@google.com> #66Aug 22, 2018 05:23AM

ke...@gmail.com <ke...@gmail.com> #67Aug 29, 2018 08:36PM

in...@gmail.com <in...@gmail.com> #68Oct 29, 2018 08:47AM

ri...@gmail.com <ri...@gmail.com> #69Dec 18, 2018 02:05PM

th...@google.com <th...@google.com> #70Feb 14, 2019 02:29PM

pe...@gmail.com <pe...@gmail.com> #71Feb 14, 2019 02:35PM

th...@google.com <th...@google.com> #72Feb 14, 2019 03:11PM

pe...@gmail.com <pe...@gmail.com> #73Feb 14, 2019 03:18PM

ya...@gmail.com <ya...@gmail.com> #74Feb 14, 2019 03:41PM

ep...@google.com <ep...@google.com> #75Feb 14, 2019 06:44PM