Fixed
Status Update
Comments
bl...@google.com <bl...@google.com>
cm...@google.com <cm...@google.com>
[Deleted User] <[Deleted User]> #2
Hey,
same api error/question here.
We would like to try out experimental feature as well.
same api error/question here.
We would like to try out experimental feature as well.
a....@mediameeting.fr <a....@mediameeting.fr> #3
We need it too... Did anyone find how we can get it ? Blocked with "Field based partitioning support is not yet available for this project" API error...
ep...@google.com <ep...@google.com> #4
We are waiting on a few approvals to start alpha for this feature shortly.
Please watch this bug for more updates.
Please watch this bug for more updates.
r....@gmail.com <r....@gmail.com> #5
That is great to hear!
I have one important question though regarding this new feature. Is it going to be implemented in such a way that we have the option to:
1) Select any column / field for partitioning BUT ALSO KEEP the time based partitioning on top of that?
2) OR is it going to be implemented in such a way that we can choose to partition on ONE FIELD ONLY, either being time or something else?
We need scenario 1 :)!
I have one important question though regarding this new feature. Is it going to be implemented in such a way that we have the option to:
1) Select any column / field for partitioning BUT ALSO KEEP the time based partitioning on top of that?
2) OR is it going to be implemented in such a way that we can choose to partition on ONE FIELD ONLY, either being time or something else?
We need scenario 1 :)!
gr...@gmail.com <gr...@gmail.com> #6
Currently looking into doing 90000 different tables. Getting Scenario 1 as described above would allow us to just use 1.
ep...@google.com <ep...@google.com> #7
We are currently doing a controlled alpha release for date/timestamp column based partitioning. Development is actively underway on scenario (1) that you outlined above. I don't have a good ETA to provide at this point, unfortunately.
r....@gmail.com <r....@gmail.com> #8
I do not understand as we already have time based partitioning?
I agree this on is a partial duplicate ofhttps://issuetracker.google.com/issues/35905817
HOWEVER, that one mentions partitioning on another field than datetime.
Scenario 1 I describe here in my previous comment would imply keeping the datetime partitioning, but added to that the possibility to partition on an additional field (or fields even).
Like you can do in a 'traditional RDBMS' that supports partitioning on multiple columns...
I agree this on is a partial duplicate of
HOWEVER, that one mentions partitioning on another field than datetime.
Scenario 1 I describe here in my previous comment would imply keeping the datetime partitioning, but added to that the possibility to partition on an additional field (or fields even).
Like you can do in a 'traditional RDBMS' that supports partitioning on multiple columns...
de...@derekperkins.com <de...@derekperkins.com> #9
I believe that refers to partitioning on a user date field as opposed to the hardcoded `_PARTITIONTIME`
ep...@google.com <ep...@google.com> #10
Correct, we are offering partitioning on a user specified timestamp or date field at this point. After that you plan to support multi-level partitioning (feature in development), where first level is date/timestamp and second level is an arbitrary integer column.
ke...@gmail.com <ke...@gmail.com> #11
Only arbitrary integers (ie, not strings)?
r....@gmail.com <r....@gmail.com> #12
I 2nd that last question, why not strings? With int only I would not want to call it real 'multi-level partitioning '...
gr...@gmail.com <gr...@gmail.com> #13
You can codify your strings to ints before loading the data. But then again, having bigQuery do it behind the scenes would be much more convenient
ep...@google.com <ep...@google.com> #14
Sorry, the inclusion of word integer in #10 was unintentional. We are planning on supporting strings as well here.
du...@gmail.com <du...@gmail.com> #15
Hope to try out experimental feature as well!
[Deleted User] <[Deleted User]> #16
looking forward to this feature!
[Deleted User] <[Deleted User]> #17
Joining the party... Is there a rough ETA for the alpha of supporting user-specified date/time partitioning? Thanks.
ep...@google.com <ep...@google.com> #18
Date/timestamp field based partitioning is in alpha. We are happy to add you to the whitelist if you are interested. It has some caveats specified in the documentation (will be shared with you separately). If you are interested, we will need the project number(s) to add to the whitelist.
[Deleted User] <[Deleted User]> #19
Who can we privately message our project number to if we're interested in being whitelisted?
ep...@google.com <ep...@google.com> #20
If you are fine with sharing the project number over email, I can send you an email to your email address. You can respond to it with your project number.
pe...@gmail.com <pe...@gmail.com> #21
send me as well, I am also a GDE on cloud
va...@doit.com <va...@doit.com> #22
I would be interested as well to participate in alpha. My project number is #276982747399
[Deleted User] <[Deleted User]> #23
I'm fine with sending it over email! I'm at michael.raimi@blueapron.com
ya...@gmail.com <ya...@gmail.com> #24
It seems the API already supports date-based partitioning under
configuration.load.timePartitioning.field
See
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.tableDefinitions
I have not tested it but request is still open for any field.
On Tue, Nov 28, 2017 at 2:31 PM, <buganizer-system@google.com> wrote:
configuration.load.timePartitioning.field
See
I have not tested it but request is still open for any field.
On Tue, Nov 28, 2017 at 2:31 PM, <buganizer-system@google.com> wrote:
ep...@google.com <ep...@google.com> #25
The API supports it, but we are doing it as a controlled alpha. Server side support is enabled only for whitelisted customers. Thanks!
sz...@book.com <sz...@book.com> #26
I would like to be added to the alpha list. Mail me at - szmehta@book.com
[Deleted User] <[Deleted User]> #27
Please whitelist our project; #753193720196
[Deleted User] <[Deleted User]> #28
Please add our project #467018477224 to whitelist. thanks!
va...@doit.com <va...@doit.com> #29
ep...@google.com <ep...@google.com> #30
Everyone that has requested for it so far has been whitelisted. Please share your feedback, if any, about issues you may face with the feature. Thanks!
[Deleted User] <[Deleted User]> #31
Just tried and it looks promising. Could you share the documentation to us? Mail: cr@17.media
Thanks!
Thanks!
pe...@aliz.ai <pe...@aliz.ai> #32
Please add project #171413420977
be...@gmail.com <be...@gmail.com> #33
Please add project number 359749582022, project id sk-data-platform. Also where do we find the documentation?
Thanks!
Thanks!
[Deleted User] <[Deleted User]> #34
It appears that bq CLI doesn't yet support '--time_partitioning_field' with inferred Avro schema from Avro data file(s). If that's the case, is there a generic converter from avsc (avro json) to bq json?
Also, the timestamp field must be a top-level field at this time. Will the beta support a nested timestamp field? Thanks.
Also, the timestamp field must be a top-level field at this time. Will the beta support a nested timestamp field? Thanks.
ep...@google.com <ep...@google.com> #35
Regarding avro load, --time_partitioning_field support seems like it should work. We tested it on bq version 2.0.27
Can you please share the command line that did not work? Also, what version of bq command line tool are you using?
At the moment we don't have immediate plans to support partitioning on nested fields, but we will consider that based on alpha feedback.
Can you please share the command line that did not work? Also, what version of bq command line tool are you using?
At the moment we don't have immediate plans to support partitioning on nested fields, but we will consider that based on alpha feedback.
[Deleted User] <[Deleted User]> #36
> bq version
This is BigQuery CLI 2.0.27
> bq load --autodetect --source_format=AVRO --time_partitioning_field=createdTimeMs xxx.yyy gs://xxx/xxx.avro
BigQuery error in load operation: Table with field based partitioning must have a schema.
Without --time_partitioning_field, the load succeeds.
This is BigQuery CLI 2.0.27
> bq load --autodetect --source_format=AVRO --time_partitioning_field=createdTimeMs xxx.yyy gs://xxx/xxx.avro
BigQuery error in load operation: Table with field based partitioning must have a schema.
Without --time_partitioning_field, the load succeeds.
ep...@google.com <ep...@google.com> #37
Thanks for reporting the issue. It looks like this is an issue with interaction between autodetect and column based partitioning. We will try to fix it as soon as possible. In the meantime, specifying a schema explicitly should work.
ya...@gmail.com <ya...@gmail.com> #38
Not sure anymore where to post this (see my comment https://issuetracker.google.com/issues/67861929#comment3 ) but specifically to custom date-field partitionning:
We receive daily large files with date format YYYYMMDD. Even with this req, we won't able to partition by the date field at loading without prepping the data beforehand (dataprep would be overdoing). The alternative is to load all dates as INTEGER or STRING and query-parse to a new set of tables, with hopefully this custom date-field partitioning. The latter is not only very cumbersome but expensive all data must be scanned from staging to final.
Hence: along with allowing a custom date-field to be used for partionning, request is to be able to specify a custom date-pattern.
We receive daily large files with date format YYYYMMDD. Even with this req, we won't able to partition by the date field at loading without prepping the data beforehand (dataprep would be overdoing). The alternative is to load all dates as INTEGER or STRING and query-parse to a new set of tables, with hopefully this custom date-field partitioning. The latter is not only very cumbersome but expensive all data must be scanned from staging to final.
Hence: along with allowing a custom date-field to be used for partionning, request is to be able to specify a custom date-pattern.
ep...@google.com <ep...@google.com> #39
Load doesn't do this, but this is possible to do with a query. Assuming these files are on GCS, you can do the following:
1. Define an external table over the files in GCS.
2. External tables have a pseudo column _FILE_NAME for each row that contains the path to the file from which the row comes. Using REGEXP_EXTRACT you can extract the date part of the path.
2. Run an insert DML statement that selects from that external table and inserts into partitioned table. Assuming it is a pseudo column (_PARTITIONTIME) partitioned table, the insert DML would something like.
INSERT partitioned_table (_PARTITIONTIME, field1, field2) SELECT REGEXP_EXTRACT(_FILE_NAME, ...), field1, field2 from external_table
1. Define an external table over the files in GCS.
2. External tables have a pseudo column _FILE_NAME for each row that contains the path to the file from which the row comes. Using REGEXP_EXTRACT you can extract the date part of the path.
2. Run an insert DML statement that selects from that external table and inserts into partitioned table. Assuming it is a pseudo column (_PARTITIONTIME) partitioned table, the insert DML would something like.
INSERT partitioned_table (_PARTITIONTIME, field1, field2) SELECT REGEXP_EXTRACT(_FILE_NAME, ...), field1, field2 from external_table
ya...@gmail.com <ya...@gmail.com> #40
Sorry I meant the YYYYMMDD dates are part of the file's content, not file name. We cannot change the date format in the source, it is hard-coded in legacy system, and we have a lot to convert. Ideally load would parse dates using custom pattern or fully supports ISO 8601 (where both YYYY-MM-DD and YYYYMMDD are accepted).
In addition, would be great to ignore date-specific nullmarker 00000000 (have logged separate request).
Data prepping script is not bullet-proof and slow (regex a date). Currently thinking of using dataframe to load, transform and stream to bq but very slow and memory-hungry (+need to provision VM).
In addition, would be great to ignore date-specific nullmarker 00000000 (have logged separate request).
Data prepping script is not bullet-proof and slow (regex a date). Currently thinking of using dataframe to load, transform and stream to bq but very slow and memory-hungry (+need to provision VM).
ep...@google.com <ep...@google.com> #41
If they are in some column inside the file, couldn't you still use the INSERT DML mechanism that was suggested above? With that approach, you won't need to run data prep, everything happens inside the INSERT statement.
ya...@gmail.com <ya...@gmail.com> #42
Agreed, but more as a workaround and at a cost as this would scan the entire table. Most tables we get have at least one date column and we have 2TB of daily data to load.
Would be so much easier - for users - to have custom date format supported, or at least have the following supported:
YYYY-[M]M-[D]D --> today
YYYY[-]MM[-]DD --> new and standard ISO8601
Both could be accommodated
Would be so much easier - for users - to have custom date format supported, or at least have the following supported:
YYYY-[M]M-[D]D --> today
YYYY[-]MM[-]DD --> new and standard ISO8601
Both could be accommodated
ep...@google.com <ep...@google.com> #43
There is no separate table scan. It would be a single scan of each input file (that you use in your external table definition) that you would insert rows directly into the necessary partition of the partitioned table. Performing a load also scans those input files.
ya...@gmail.com <ya...@gmail.com> #44
Thanks - hadn't thought about that. Good news on the cost aspect but please
do leave it open as it would be more straightforward if date-format could
be specified or '-' be ignored.
Will use the suggested workaround in the meantime !
On Sun, Dec 3, 2017 at 4:22 PM, <buganizer-system@google.com> wrote:
do leave it open as it would be more straightforward if date-format could
be specified or '-' be ignored.
Will use the suggested workaround in the meantime !
On Sun, Dec 3, 2017 at 4:22 PM, <buganizer-system@google.com> wrote:
ke...@gmail.com <ke...@gmail.com> #45
This is also discussed here:
https://stackoverflow.com/questions/33849969/custom-date-format-for-loading-data-into-bigquery-using-bq
https://stackoverflow.com/questions/44165737/how-to-load-an-integer-value-in-an-avro-file-to-a-date-column-in-bigquery
Are there any plans - either using the 'excuse' of the column-based time partitioning rollout to enrich functionality in this area, or just because more generally it's a desirable feature - to enhance support for parsing different date formats at load-time, and avoid the need to do post-processing of datasets with date values that don't happen to match BigQuery's preferred format?
Are there any plans - either using the 'excuse' of the column-based time partitioning rollout to enrich functionality in this area, or just because more generally it's a desirable feature - to enhance support for parsing different date formats at load-time, and avoid the need to do post-processing of datasets with date values that don't happen to match BigQuery's preferred format?
ya...@gmail.com <ya...@gmail.com> #46
I am confused actually with #43.
Scanning = querying. As perhttps://cloud.google.com/bigquery/pricing , loading is free but querying external to insert into a partitioned table would not (see INSERT under https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language ).
If so, in addition to the boilerplate, this is very costly just to convert dates.
Scanning = querying. As per
If so, in addition to the boilerplate, this is very costly just to convert dates.
pe...@gmail.com <pe...@gmail.com> #47
I tried to understand your problem, I read that in your thinking there are multiple level of significance/costs.
1. you want to load to partition FREE (as load is free)
2. the #43 mentioned that it will be 1 single file scan (it's not free but it's only 1 single file scan)
3. for the sake of example, 3rd case would be multiple scans probably using multiple queries
you didn't explained what "costly" meant for you ($10 for 2TB), if you have 1 single file scan, and paying 1 time the single scan would you consider that to be "costly" having 2TB of daily data?
1. you want to load to partition FREE (as load is free)
2. the #43 mentioned that it will be 1 single file scan (it's not free but it's only 1 single file scan)
3. for the sake of example, 3rd case would be multiple scans probably using multiple queries
you didn't explained what "costly" meant for you ($10 for 2TB), if you have 1 single file scan, and paying 1 time the single scan would you consider that to be "costly" having 2TB of daily data?
ya...@gmail.com <ya...@gmail.com> #48
Querying from external and inserting to table is not much different from loading into staging (with dates = String) and inserting parsed dates to new set of tables. Cost is the same.
Sure the former is one scan vs. 2 for the latter but as pointed by others, either way could be avoided with the support of custom date format or looser rule ('-' in date optional).
In addition, we use Airflow and I am not sure there is an operator to perform a DML on GCS data.
What costly means here (to me) is irrelevant. 2TB is not a lot for big data, others may have bigger savings. But there are more benefits to the request. Maybe this should be logged separately.
I am just surprised that a date-pattern parameter on load was never raised before.
Sure the former is one scan vs. 2 for the latter but as pointed by others, either way could be avoided with the support of custom date format or looser rule ('-' in date optional).
In addition, we use Airflow and I am not sure there is an operator to perform a DML on GCS data.
What costly means here (to me) is irrelevant. 2TB is not a lot for big data, others may have bigger savings. But there are more benefits to the request. Maybe this should be logged separately.
I am just surprised that a date-pattern parameter on load was never raised before.
[Deleted User] <[Deleted User]> #49
We'd like to be whitelisted: project id is 413669492896, email gil.jaysmith@techland.pl
Cheers!
Cheers!
de...@derekperkins.com <de...@derekperkins.com> #50
Now that this alpha has been running for ~2 months, is there an ETA for GA release?
ge...@markovian.com <ge...@markovian.com> #51
Could you whitelist our project ID 679991979866 please? Email: georg.walther@markovian.com
[Deleted User] <[Deleted User]> #52
[Deleted User] <[Deleted User]> #53
Is there any way to overwrite a specific partition the same way it used to work with the _PARTITIONTIME decorator ?
When trying do so with --replace / 'WRITE_TRUNCATE' get the following error:
*The table is partitioned by a field. Data can only be added to the table itself and distributed to partitions by the system.*
Without the decorator it truncates the whole table.
When trying do so with --replace / 'WRITE_TRUNCATE' get the following error:
*The table is partitioned by a field. Data can only be added to the table itself and distributed to partitions by the system.*
Without the decorator it truncates the whole table.
ep...@google.com <ep...@google.com> #54
Regarding WRITE_TRUNCATE on a partition, we will have the ability to do this through the MERGE DML statement (https://issuetracker.google.com/35905927 )
Depending on when that becomes available, we are exploring alternatives ways to achieve this functionality.
Depending on when that becomes available, we are exploring alternatives ways to achieve this functionality.
ep...@google.com <ep...@google.com> #55
There is no ETA for GA at this point, tentatively hope to go beta in the next quarter.
me...@disney.com <me...@disney.com> #56
Hello, can we please get whitelisted as well?
project id: 498740506390
email: melissa.j.perkins.-nd@disney.com; caleb.jones@disney.com;
project id: 498740506390
email: melissa.j.perkins.-nd@disney.com; caleb.jones@disney.com;
[Deleted User] <[Deleted User]> #57
Facing below issue when trying to load partitioned table from src table using _time_partitioning_field .
My intention is to move data from src to dest, where src is not partitioned but have field named 'created' and dest is partitioned by day. So wanted to move data partitioned based on 'created' date
Our prj is already whitelisted for alpha version. If you guys have any solution to this problem, pls let us know.
BigQuery error in load operation: Incompatible table partitioning specification. Expects partitioning specification interval(type:day,field:created), but input partitioningspecification is interval(type:day)
Bq load command used is
bq load --source_format=AVRO --time_partitioning_type=DAY --time_partitioning_field='created' userstore.Test_Record_Dest_P1 ~/Downloads/test_01.avro myschema.json
Any suggestions to make it work ?
My intention is to move data from src to dest, where src is not partitioned but have field named 'created' and dest is partitioned by day. So wanted to move data partitioned based on 'created' date
Our prj is already whitelisted for alpha version. If you guys have any solution to this problem, pls let us know.
BigQuery error in load operation: Incompatible table partitioning specification. Expects partitioning specification interval(type:day,field:created), but input partitioningspecification is interval(type:day)
Bq load command used is
bq load --source_format=AVRO --time_partitioning_type=DAY --time_partitioning_field='created' userstore.Test_Record_Dest_P1 ~/Downloads/test_01.avro myschema.json
Any suggestions to make it work ?
ma...@groovymedia.com <ma...@groovymedia.com> #58
We would love to test this feature! Project number: 998012707676, email: maxim@groovymedia.com
Thanks!
Thanks!
ep...@google.com <ep...@google.com> #59
Responding to: https://issuetracker.google.com/65440943#comment57
Are you using the latest version of the bq command line tool? I verified that with version 2.0.27 it works correctly.
Are you using the latest version of the bq command line tool? I verified that with version 2.0.27 it works correctly.
[Deleted User] <[Deleted User]> #60
Hey guys,
We are playing with this and are really happy with the results so far. The question I have is that because this is in alpha it could be turned off at any time. What would happen to our existing tables in that scenario?
Thanks.
We are playing with this and are really happy with the results so far. The question I have is that because this is in alpha it could be turned off at any time. What would happen to our existing tables in that scenario?
Thanks.
ep...@google.com <ep...@google.com> #61
We are planning on going beta sometime next quarter. At this point, we think that existing tables will be generally unaffected.
sa...@gmail.com <sa...@gmail.com> #62
+ request to be added to whitelist
ep...@google.com <ep...@google.com> #63
We'd need your project number, in order to whitelist you.
sa...@gmail.com <sa...@gmail.com> #64
Email me and I'll tell you? Buganizer doesn't show me your email, only
shows mine to you.
shows mine to you.
de...@derekperkins.com <de...@derekperkins.com> #65
Is it possible to whitelist an organization so that any projects inside the organization can use this?
ep...@google.com <ep...@google.com> #66
Yes, it is possible to whitelist an entire organization.
de...@derekperkins.com <de...@derekperkins.com> #67
That's awesome. Can I get organization 496154168690 whitelisted? Thanks!
ep...@google.com <ep...@google.com> #68
Done.
ro...@storemaven.com <ro...@storemaven.com> #69
Field partition based table access acts differently than pseudo _PARTITIONTIME based ones.
We encountered an issue which led to tremendously high costs while migrating vasts amounts of data using field-based partitioning.
project id: mockappstore
datasets involved: sm_migration & sm_analytics
The query below will FULL SCAN a table even though it is partitioned on eventTime field (TIMESTAMP type):
SELECT customerName,productName,testName,storeName,action,category,context,uId,userType,userIP,eventTime,clientTime,sessionId,firstSessionId,firstSessionTime,imageIndex,campaign,
ad,cId,pId,testId,storeId,expectedUserDevice,userDevice,userPlatform,userAgent,userBrowser,userOS,userBrowserVersion,iphoneModel,userRegion,testType,serverVersion,appVersion,
referer,userCountry,userCity,xFbHttpEngine,xPurpose,aString,aIndex,qString,qIndex,videoTime,userRequestedWith,editorsNotes,lastVisibleImage,maxVisibleImage,firstVisibleImage,
itemIndex,itemName,lastVisibleItem,featuredApp,galleryOrientation,entry,extraHeaders,extraUrl,browserEvents,gyro,battery,stickerGallery, videoSessionIndex,videoContinueIndex,sessionDurations,storeOS
FROM sm_migration.cust_867c775a766c4a04a346dd341e74026a
WHERE TIMESTAMP(DATE(eventTime)) = PARSE_TIMESTAMP('%Y%m%d','20170312', 'UTC')
Results a FULL SCAN - 5.73 Megabytes
The equivalent table partitioned by _PARTITIONTIME will scan the 2017-03-12 partitions only (expected behavior), see query:
SELECT customerName,productName,testName,storeName,action,category,context,uId,userType,userIP,eventTime,clientTime,sessionId,firstSessionId,firstSessionTime,imageIndex,campaign,
ad,cId,pId,testId,storeId,expectedUserDevice,userDevice,userPlatform,userAgent,userBrowser,userOS,userBrowserVersion,iphoneModel,userRegion,testType,serverVersion,appVersion,
referer,userCountry,userCity,xFbHttpEngine,xPurpose,aString,aIndex,qString,qIndex,videoTime,userRequestedWith,editorsNotes,lastVisibleImage,maxVisibleImage,firstVisibleImage,
itemIndex,itemName,lastVisibleItem,featuredApp,galleryOrientation,entry,extraHeaders,extraUrl,browserEvents,gyro,battery,stickerGallery, videoSessionIndex,videoContinueIndex,sessionDurations,storeOS
FROM sm_analytics.cust_p_867c775a766c4a04a346dd341e74026a
WHERE TIMESTAMP(DATE(_PARTITIONTIME)) = PARSE_TIMESTAMP('%Y%m%d','20170312', 'UTC')
Results a SINGLE PARTITION SCAN - 335 Kilobytes (Perfectly fine)
The table above, which stores 5.73 MB of data is used to emphasize our issue. We got tables with over 800 GB of data, and the migration processes 712 days.
Instead of 712 single partitions scans == the same cost as 1 FULL TABLE SCAN, this causes 712 FULL SCANS on each of our 680+ tables.
Thanks
We encountered an issue which led to tremendously high costs while migrating vasts amounts of data using field-based partitioning.
project id: mockappstore
datasets involved: sm_migration & sm_analytics
The query below will FULL SCAN a table even though it is partitioned on eventTime field (TIMESTAMP type):
SELECT customerName,productName,testName,storeName,action,category,context,uId,userType,userIP,eventTime,clientTime,sessionId,firstSessionId,firstSessionTime,imageIndex,campaign,
ad,cId,pId,testId,storeId,expectedUserDevice,userDevice,userPlatform,userAgent,userBrowser,userOS,userBrowserVersion,iphoneModel,userRegion,testType,serverVersion,appVersion,
referer,userCountry,userCity,xFbHttpEngine,xPurpose,aString,aIndex,qString,qIndex,videoTime,userRequestedWith,editorsNotes,lastVisibleImage,maxVisibleImage,firstVisibleImage,
itemIndex,itemName,lastVisibleItem,featuredApp,galleryOrientation,entry,extraHeaders,extraUrl,browserEvents,gyro,battery,stickerGallery, videoSessionIndex,videoContinueIndex,sessionDurations,storeOS
FROM sm_migration.cust_867c775a766c4a04a346dd341e74026a
WHERE TIMESTAMP(DATE(eventTime)) = PARSE_TIMESTAMP('%Y%m%d','20170312', 'UTC')
Results a FULL SCAN - 5.73 Megabytes
The equivalent table partitioned by _PARTITIONTIME will scan the 2017-03-12 partitions only (expected behavior), see query:
SELECT customerName,productName,testName,storeName,action,category,context,uId,userType,userIP,eventTime,clientTime,sessionId,firstSessionId,firstSessionTime,imageIndex,campaign,
ad,cId,pId,testId,storeId,expectedUserDevice,userDevice,userPlatform,userAgent,userBrowser,userOS,userBrowserVersion,iphoneModel,userRegion,testType,serverVersion,appVersion,
referer,userCountry,userCity,xFbHttpEngine,xPurpose,aString,aIndex,qString,qIndex,videoTime,userRequestedWith,editorsNotes,lastVisibleImage,maxVisibleImage,firstVisibleImage,
itemIndex,itemName,lastVisibleItem,featuredApp,galleryOrientation,entry,extraHeaders,extraUrl,browserEvents,gyro,battery,stickerGallery, videoSessionIndex,videoContinueIndex,sessionDurations,storeOS
FROM sm_analytics.cust_p_867c775a766c4a04a346dd341e74026a
WHERE TIMESTAMP(DATE(_PARTITIONTIME)) = PARSE_TIMESTAMP('%Y%m%d','20170312', 'UTC')
Results a SINGLE PARTITION SCAN - 335 Kilobytes (Perfectly fine)
The table above, which stores 5.73 MB of data is used to emphasize our issue. We got tables with over 800 GB of data, and the migration processes 712 days.
Instead of 712 single partitions scans == the same cost as 1 FULL TABLE SCAN, this causes 712 FULL SCANS on each of our 680+ tables.
Thanks
ep...@google.com <ep...@google.com> #70
Thanks for the feedback. Field partitioned tables based on timestamp field will behave differently for partition pruning from pseudo column partitioned tables due to the granularity of values stored in the column. I am not sure if you have access to the alpha documentation, but this is one of the caveats mentioned there. Here are a few things we are planning on providing soon:
1. A require_partition_filter flag on a partitioned table. Setting this to true will ensure that the system will fail queries that do NOT have a filter that can be used for evaluating the partitions to scan.
2. We are improving partition pruning for field based partitioned tables. We have identified common functions that are used on timestamp fields and support them for partition pruning. We are going to keep adding to this list (and document it). For example, since we launched the alpha, we have added support for predicates of the kind DATE(eventTime) = "2017-03-12" for pruning. If you have other common filters that you use, please list them here and we will look into adding support for them.
For filters that aren't supported for pruning, users can avoid incurring huge bills by setting the flag described in (1).
1. A require_partition_filter flag on a partitioned table. Setting this to true will ensure that the system will fail queries that do NOT have a filter that can be used for evaluating the partitions to scan.
2. We are improving partition pruning for field based partitioned tables. We have identified common functions that are used on timestamp fields and support them for partition pruning. We are going to keep adding to this list (and document it). For example, since we launched the alpha, we have added support for predicates of the kind DATE(eventTime) = "2017-03-12" for pruning. If you have other common filters that you use, please list them here and we will look into adding support for them.
For filters that aren't supported for pruning, users can avoid incurring huge bills by setting the flag described in (1).
ep...@google.com <ep...@google.com> #71
Repurposed this bug to be specific to timestamp and date field partitioning.
https://issuetracker.google.com/35905817 tracks the non-date/timestamp field partitioning.
ci...@gmail.com <ci...@gmail.com> #72
We also want to be added to the whitelist. Can you please shoot an email to cityscooper@gmail.com for the project id. Thanks.
wo...@gmail.com <wo...@gmail.com> #73
We want to try this feature, projectid: 231841946736.Thanks.
ep...@google.com <ep...@google.com> #74
Due to the holidays, we will able to resume whitelisting only after new year. Apologies for any inconvenience. Thanks!
[Deleted User] <[Deleted User]> #75
Recently BigQuery added the ability to query partitioned wildcard tables. This does not appear to work with field based partitioning. Is this on the roadmap to fix? This is the error:
BigQuery error in update operation: Wildcard table over field partitioning tables is disallowed.
BigQuery error in update operation: Wildcard table over field partitioning tables is disallowed.
ep...@google.com <ep...@google.com> #76
Yes, we will be supporting this before going to beta (tentatively mid Q1).
ep...@google.com <ep...@google.com> #77
All whitelist requests have been processed.
[Deleted User] <[Deleted User]> #78
Can I please get organization 600153968971 added to the white list. Thanks.
[Deleted User] <[Deleted User]> #79
Hey guys, can I please get whitelisted to test this feature? projectid: 450940328522
[Deleted User] <[Deleted User]> #80
We were trying to test this and realized it's not GA. Please add 404270122387 to the whitelist. Thanks!
al...@veolia.com <al...@veolia.com> #81
Hello,
Could you add the project 192056161420 to the whitelist?
Thanks.
Regards,
Alexandre
Could you add the project 192056161420 to the whitelist?
Thanks.
Regards,
Alexandre
ya...@gmail.com <ya...@gmail.com> #82
please whitelist 897478599326 and 333196740116
Thanks, Yannick
Thanks, Yannick
ya...@gmail.com <ya...@gmail.com> #83
Was hoping to work on that during the w-e.
Any ETA on pending whitelistings ? Thanks !
Any ETA on pending whitelistings ? Thanks !
ep...@google.com <ep...@google.com> #84
All pending whitelist requests have been processed as of 11 am PST today.
ma...@gmail.com <ma...@gmail.com> #85
Please whitelist 368656566068, 1007394073692 and 464117503368.
Thank you!
Thank you!
ya...@gmail.com <ya...@gmail.com> #86
Forgot one - can you please whitelist 529599436172 too
Thanks.
Thanks.
ep...@google.com <ep...@google.com> #87
All whitelists so far have been processed.
ya...@gmail.com <ya...@gmail.com> #88
With the recent support of CREATE TABLE DDL, can whitelisted projects use user-defined date field for the PARTITION BY parameter ?
If not, please consider.
If not, please consider.
el...@google.com <el...@google.com> #89
Yes, you can use it. You should be able to execute statements such as these if you are part of the column-based partitioning alpha: CREATE TABLE my_dataset.my_table (x INT64, y DATE) PARTITION BY y; or CREATE TABLE my_dataset.my_table (x INT64, y TIMESTAMP) PARTITION BY DATE(y);
Once column-based partitioning is in beta (tentatively mid-Q1, per #76), we will updatehttps://cloud.google.com/bigquery/docs/data-definition-language to reflect the supported syntax.
Once column-based partitioning is in beta (tentatively mid-Q1, per #76), we will update
ya...@gmail.com <ya...@gmail.com> #90
That's great - however I just hit an undocumented limit:
Error: Too many partitions produced by query, allowed 500, query produces at least 1095 partitions
(Job ID: sunlit-flag-177515:bquijob_74fe0af7_1610ffe2449)
I read somewhere about a 2000 partitions limit which, if enforced, would work fine in my use case - need 5 years of daily data, partitioned to allow continued discovery and very short response time.
500 is really to low.
Can you increase that threshold or wave ?
Error: Too many partitions produced by query, allowed 500, query produces at least 1095 partitions
(Job ID: sunlit-flag-177515:bquijob_74fe0af7_1610ffe2449)
I read somewhere about a 2000 partitions limit which, if enforced, would work fine in my use case - need 5 years of daily data, partitioned to allow continued discovery and very short response time.
500 is really to low.
Can you increase that threshold or wave ?
ep...@google.com <ep...@google.com> #91
This is documented in the alpha doc that was shared with our whitelisted users. We hope to increase it to 2500. Note that this limit today is different from total number of partitions in table (this is already 2500 today).
ya...@gmail.com <ya...@gmail.com> #92
My bad, you're right. I have gone around it by issuing one 'insert into'
DML per year.
DML per year.
[Deleted User] <[Deleted User]> #93
Please add two projects. Can I also please get a copy of the alpha doc?
961768382261
368656566068
961768382261
368656566068
[Deleted User] <[Deleted User]> #94
Can you please whitelist 187213475372.
Thanks.
Thanks.
ep...@google.com <ep...@google.com> #95
Whitelist requests so far have been processed.
di...@monzo.com <di...@monzo.com> #96
Can you whitelist 962019738027 as well please?
Thanks in advance!
Thanks in advance!
ma...@gmail.com <ma...@gmail.com> #97
Please whitelist 961768382261 as well! thank you!
dr...@gtempaccount.com <dr...@gtempaccount.com> #98
Please whitelist 464117503368, thank you!
ad...@google.com <ad...@google.com> #99
Please whitelist 228840735359 as well and let me know when it has been processed?
Thank you
Arend
Thank you
Arend
[Deleted User] <[Deleted User]> #100
Please whitelist 1057665635705 as well, thanks
ep...@google.com <ep...@google.com> #101
All whitelist requests up to (and including) #99 have been processed.
da...@gmail.com <da...@gmail.com> #102
Please whitelist project ID 57024152020. Thank you so much.
[Deleted User] <[Deleted User]> #103
Could you please whitelist project ID 1006213818196? Thanks
mi...@qlirogroup.com <mi...@qlirogroup.com> #104
We would also very much appreciate a whitelist: 369318398231, Thanks
[Deleted User] <[Deleted User]> #105
Is it too late to sign up?
These projects:
478780302124
112734977199
62448027116
438110786158
Thank you!
These projects:
478780302124
112734977199
62448027116
438110786158
Thank you!
ta...@care2team.com <ta...@care2team.com> #106
I would very much appreciate a whitelist for our project, which is #176550870186.
Thank you!
Thank you!
ak...@gmail.com <ak...@gmail.com> #107
We would like to try this feature, projects: db-cm-dev and db-cm-pro. Thanks.
ma...@gmail.com <ma...@gmail.com> #108
This is in beta now: https://cloud.google.com/bigquery/docs/creating-column-partitions
You can already create these tables and query them. They've been great for us so far, although they don't currently support DML (update/delete) statements.
You can already create these tables and query them. They've been great for us so far, although they don't currently support DML (update/delete) statements.
ep...@google.com <ep...@google.com> #109
DML support will be available shortly. I am closing this feature request. Partitioning on non-date/timestamp field is tracked in https://issuetracker.google.com/35905817
ya...@gmail.com <ya...@gmail.com> #110
Sorry for the cross-posting as I wrote the same on the dedicated google group for alpha users but just in case, as the group had only 4 posts so far...:
I have 2 (actually 252) tables which each are partioned the same way, using an "as_of" timestamp - these are then partioned by day (e.g. in below, partition 2012-11-09).
In below query, data gets pruned just fine.
SELECT
t2.*
FROM
`dataset.t1` join `dataset.t2` using (a, b, c)
WHERE
t1.as_of=TIMESTAMP('2012-11-09T23:59:51')
AND
t2.as_of=TIMESTAMP('2012-11-09T23:59:51')
However, when I set only one condition and use a join "for the other(s)", data is not pruned on the joined tables and a full tablescan is performed on table t2:
SELECT
t2.*
FROM
`dataset.t1` join `dataset.t2` using (a, b, c, as_of)
WHERE
t1.as_of=TIMESTAMP('2012-11-09T23:59:51')
The latter is more convenient especially if you have a lot of tables to join.
I have 2 (actually 252) tables which each are partioned the same way, using an "as_of" timestamp - these are then partioned by day (e.g. in below, partition 2012-11-09).
In below query, data gets pruned just fine.
SELECT
t2.*
FROM
`dataset.t1` join `dataset.t2` using (a, b, c)
WHERE
t1.as_of=TIMESTAMP('2012-11-09T23:59:51')
AND
t2.as_of=TIMESTAMP('2012-11-09T23:59:51')
However, when I set only one condition and use a join "for the other(s)", data is not pruned on the joined tables and a full tablescan is performed on table t2:
SELECT
t2.*
FROM
`dataset.t1` join `dataset.t2` using (a, b, c, as_of)
WHERE
t1.as_of=TIMESTAMP('2012-11-09T23:59:51')
The latter is more convenient especially if you have a lot of tables to join.
ya...@gmail.com <ya...@gmail.com> #111
Wanted to add that this is not just for convenience: the first query cannot leverage a view, whereas for the second, if pruning is supported, one could create a view:
CREATE VIEW `dataset.myview` AS
SELECT
t2.*
FROM
`dataset.t1` join `dataset.t2` using (a, b, c, as_of)
and running a
SELECT * FROM myview WHERE as_of = '2012-11-09T23:59:51'
would prune nice and clean.
CREATE VIEW `dataset.myview` AS
SELECT
t2.*
FROM
`dataset.t1` join `dataset.t2` using (a, b, c, as_of)
and running a
SELECT * FROM myview WHERE as_of = '2012-11-09T23:59:51'
would prune nice and clean.
Description
is there a way to use this experimental feature? I'm currently developing a new app which would make a great use of it. I'm currently getting "Field based partitioning support is not yet available for this project" error from the API.