Status Update
Comments
ma...@mpb.com <ma...@mpb.com> #2
ri...@google.com <ri...@google.com> #3
Hi,
Can you provide more information about:
- Steps to reproduce the issue.
- If possible can you provide a sample data for reproduction. Please remove PII if there are any.
If possible can you also provide a screenshot of the error?
Thanks
ma...@mpb.com <ma...@mpb.com> #4
{
"insertId": "63a0f443-0000-2ad1-bbc1-f403045f7a4e@a1",
"jsonPayload": {
"context": "CDC",
"event_code": "UNSUPPORTED_EVENTS_DISCARDED",
"read_method": "",
"message": "Discarded 1180 unsupported events for BigQuery destination: 880653332314.datastream_txns_public.adjustment_adjustmentmodel, with reason code: BIGQUERY_TOO_MANY_PRIMARY_KEYS, details: Failed to create the table in BigQuery, because the source table has too many primary keys.."
},
"resource": {
"type": "
"labels": {
"resource_container": "REDACTED",
"stream_id": "sandpit-txns-to-sandpit-bq1",
"location": "europe-west2"
}
},
"timestamp": "2022-10-04T12:56:20.846705Z",
"severity": "WARNING",
"logName": "projects/REDACTED/logs/
"receiveTimestamp": "2022-10-04T12:56:21.525370213Z"
},
ma...@mpb.com <ma...@mpb.com> #5
CREATE TABLE public.adjustment_adjustmentmodel (
created_date timestamptz NOT NULL,
modified_date timestamptz NOT NULL,
id uuid NOT NULL,
adjustment_type int4 NOT NULL,
adjusted_item_id uuid NOT NULL,
line_item_id uuid NOT NULL,
transaction_reference_id uuid NOT NULL,
promotion_id uuid NULL,
CONSTRAINT adjustment_adjustmentmodel_line_item_id_key UNIQUE (line_item_id),
CONSTRAINT adjustment_adjustmentmodel_pkey PRIMARY KEY (id)
);
ma...@mpb.com <ma...@mpb.com> #6
le...@andgosystems.com <le...@andgosystems.com> #7
CASE WHEN pg_index.indisprimary IS NULL THEN $32 ELSE $33 END AS is_primary_key
but pg_index.indisprimary can be 't' or 'f', so just checking for NULL results in columns from non-primary key indexes to be flagged as a primary index. The line should change to something like:
CASE WHEN pg_index.indisprimary = 't' THEN $33 ELSE $32 END AS is_primary_key
This query also is the cause of another bug: 251216031
ri...@google.com <ri...@google.com> #8
Hi,
Just to confirm, are you following this
The log you have posted pertains to datastream_txns_public.adjustment_adjustmentmodel and your DDL points to public.adjustment_adjustmentmodel. There seems to have a mismatch of the logs and DDL. Is this expected?
Thanks
ma...@mpb.com <ma...@mpb.com> #9
ma...@mpb.com <ma...@mpb.com> #10
ri...@google.com <ri...@google.com> #11
ri...@google.com <ri...@google.com> #12
Hi @mark.doutre,
I tried replicating your issue by following this
Prior to this I created a table in my CloudSQL Postgres database using the DDL you have provided. See schema [1].
I created mock data using the query below:
INSERT INTO public.adjustment_adjustmentmodel (created_date,modified_date,id,adjustment_type,adjusted_item_id,line_item_id,transaction_reference_id,promotion_id) values ('2022-09-15 19:00+11:00','2022-09-16 19:00+11:00',gen_random_uuid (),1,gen_random_uuid (),gen_random_uuid (),gen_random_uuid (),gen_random_uuid ());
Postgres Data is inserted successfully as seen in [2]. I proceeded with creating the profile for both Postgres and BigQuery as seen in the
Let me know if I missed anything on my reproduction steps so I can retry my replication based on the steps you have taken.
[1] adjustment_schema.png
[2] postgres_query_output.png
[3] created_stream.png
[4] bq_streamed_data.png
ma...@mpb.com <ma...@mpb.com> #13
When I view the schema in datastream, I see the attached.
le...@andgosystems.com <le...@andgosystems.com> #14
Hi ri...,
To replicate the issue you need to add indexes that also reference the primary key column. For example, creating this table and trying to make a stream with this table will fail with the BIGQUERY_TOO_MANY_PRIMARY_KEYS error, even though it clearly only has 1 primary key with a single column id
.
CREATE TABLE too_many_keys_failure (
id int,
created_date timestamp,
last_modified_date timestamp,
user_id int,
facility_id int,
manager_id int,
is_available bool,
CONSTRAINT id_pk PRIMARY KEY (id) --NOTE THAT THIS IS THE ONLY PRIMARY KEY!
);
--NOTE THE NON-PRIMARY KEY INDEXES
CREATE INDEX ON too_many_keys_failure (id, user_id, facility_id, manager_id, is_available, last_modified_date);
CREATE INDEX ON too_many_keys_failure (id, user_id, facility_id, manager_id, is_available);
CREATE INDEX ON too_many_keys_failure (id, user_id, facility_id, manager_id, last_modified_date);
CREATE INDEX ON too_many_keys_failure (id, user_id, facility_id, is_available, last_modified_date);
CREATE INDEX ON too_many_keys_failure (id, user_id, facility_id);
CREATE INDEX ON too_many_keys_failure (id, facility_id);
INSERT INTO too_many_keys_failure
VALUES
(1,current_timestamp, current_timestamp, '1','1','2',false),
(2,current_timestamp, current_timestamp, '2','1',null,false),
(3,current_timestamp, current_timestamp, '3','1',null,false);
This appears to be due to how Datastream attempts to detect primary keys in its query to PostgreSQL. I believe there is a bug in the query that was written/generated where it checks pg_index.indisprimary IS NULL
instead of pg_index.indisprimary = 't'
The screenshot attached shows the failure when creating a new stream for this the table public_too_many_keys_failure
. Note that the prefix of the schema public
is because I chose the "single dataset for all schemas option when setting up the stream.
I've also in the stream added a second table called public_too_this_one_works
that is identical to this table except without the non-primary key indexes. This one is shown in the screenshot and we can see that it wrote the 3 records.
ri...@google.com <ri...@google.com> #15
Hi @leon.verhelst,
Thank you for providing additional details. I will provide an update on my replication and findings.
ri...@google.com <ri...@google.com> #16
Hi,
I was able to replicate the issue. I reached out to the product team and created an internal issue about this. Please keep in mind that this issue has to be analyzed and considered by the product team and I can't provide you an ETA for it. However, you can keep track of the status by following this thread.
ri...@google.com <ri...@google.com>
ri...@google.com <ri...@google.com> #18
Hi,
I would appreciate it if you can provide your insight on this. Assuming a unique key is a must for clustering to properly load data into BigQuery, would you expect data stream to randomly choose a unique index if one such index exists and use it as the primary key in BigQuery, or look for primary keys only, and fail if none exists?
Thanks
ma...@mpb.com <ma...@mpb.com> #19
I would rather see an option on the destination creation, where the use can specify how the data should be clustered or partitioned if required. For instance, in my use case I want to take transactional data from Postgres and load it into BQ for analytics purposed. The destination query workloads are going to be different from the source workloads, so it would be advantage for my usecase if I could cluster data, for example, on some userid etc to assist in analysis.
le...@andgosystems.com <le...@andgosystems.com> #20
I would expect BigQuery to respect the REPLICA IDENTITY
from the source tables and act similar to PostgreSQL's rules for setting up a publication:
From:
A published table must have a “replica identity” configured in order to be able to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side. By default, this is the primary key, if there is one. Another unique index (with certain additional requirements) can also be set to be the replica identity. If the table does not have any suitable key, then it can be set to replica identity “full”, which means the entire row becomes the key. This, however, is very inefficient and should only be used as a fallback if no other solution is possible. If a replica identity other than “full” is set on the publisher side, a replica identity comprising the same or fewer columns must also be set on the subscriber side. See REPLICA IDENTITY for details on how to set the replica identity. If a table without a replica identity is added to a publication that replicates UPDATE or DELETE operations then subsequent UPDATE or DELETE operations will cause an error on the publisher. INSERT operations can proceed regardless of any replica identity.
A Postgres -> BigQuery replication should use the REPLICA IDENTITY
that is set on the source table, which normally is set like so:
- Use the PK if exists
- Otherwise use a specified unique index as per the table definition
- Otherwise use the full row
For information on how to set the replica identity see:
Finding the replica identity for a table is done as described here:
ed...@rentpure.com <ed...@rentpure.com> #21
ti...@pacerevenue.com <ti...@pacerevenue.com> #22
di...@gmail.com <di...@gmail.com> #23
jo...@getprizepool.com <jo...@getprizepool.com> #24
lp...@stratechcorp.com <lp...@stratechcorp.com> #25
et...@google.com <et...@google.com> #26
A fix for this bug is currently being rolled-out, and should be applied to all Google Cloud regions by the end of the week (Oct. 29).
an...@trumidtech.com <an...@trumidtech.com> #27
jo...@getprizepool.com <jo...@getprizepool.com> #28
pr...@google.com <pr...@google.com> #29
Is this issue the same solution the following error message?
BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE
lp...@stratechcorp.com <lp...@stratechcorp.com> #30
datastream because it does not allow to generate the partitioned table?
or am i doing something wrong?
wf...@dataskrive.com <wf...@dataskrive.com> #31
re...@ethic.com <re...@ethic.com> #32
In our case the table it is trying to copy over from Cloud SQL (MySQL) to BigQuery using the new Datastream feature does have 5 columns as its Primary Key. Is there a limit to the number of columns in a primary key for this to work? Not sure why this is a limitation..
- Error message details: Failed to create the table in BigQuery, because the source table has too many primary keys.."
ri...@99.co <ri...@99.co> #33
step by step to reproduce this:
1. start datastream from postgresql to bigquery
2. after the transfer done, all tables in bigquery, pause the job
3. partition one of the table
4. resume the job
5. the log says
{
"insertId": "640c1ed5-0000-20cd-8059-883d24fc7d54@a1",
"jsonPayload": {
"read_method": "",
"event_code": "UNSUPPORTED_EVENTS_DISCARDED",
"context": "CDC",
"message": "Discarded 25 unsupported events for BigQuery destination: DATASET_ID, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported.."
},
"resource": {
"type": "
"labels": {
"resource_container": "",
"location": "LOCATION",
"stream_id": "DATASET_ID"
}
},
"timestamp": "2022-11-16T04:40:05.318457Z",
"severity": "WARNING",
"logName": "projects/PROJECT_ID/logs/
"receiveTimestamp": "2022-11-16T04:40:06.332008985Z"
}
i checked the differences only lies on the partitioned table or not, the cluster is the same (using the id of that table).
when i changed back the destination table to not having partition, it works successfully
ju...@gmail.com <ju...@gmail.com> #34
{
insertId: "64443c8d-0000-2756-9db0-14c14ef32a9c@a1"
jsonPayload: {
context: "CDC"
event_code: "UNSUPPORTED_EVENTS_DISCARDED"
message: "Discarded 1677 unsupported events for BigQuery destination: [my table], with reason code: BIGQUERY_TOO_MANY_PRIMARY_KEYS, details: Failed to create the table in BigQuery, because the source table has too many primary keys.."
read_method: ""
}
logName: "projects/PROJECT_ID/logs/
receiveTimestamp: "2022-11-19T00:51:48.226399021Z"
resource: {2}
severity: "WARNING"
timestamp: "2022-11-19T00:51:48.177058Z"
}
Create table statement from source Postgres Cloud SQL:
create table myschema.mytable
(
company_id bigint not null,
region_id integer not null,
day date not null,
sales numeric not null,
hits numeric not null,
constraint mytable_uniq
unique (company_id, region_id, day)
);
et...@google.com <et...@google.com> #35
Apparently there's been some regression to this issue... an updated fix is pending, and will be rolled out ASAP.
I'll update here again once the fix is in production.
[Deleted User] <[Deleted User]> #36
Hi Team, - Seems like I am as well having the same issue as #33 (This issues is blocking for me in production )Please let us know the status in this .
I am getting error when I was trying to partition the destination table in BigQuery while working with DataStream.
step by step to reproduce this:
1. start DataStream from CloudSQL(MYSQL) to BigQuery
2. once the Stream Completed all tables in BigQuery, pause the job
3. Partition one of the table
4. Resume the job
5. Getting error log as below
====================================================
Discarded 97 unsupported events for BigQuery destination: 833537404433.Test_Membership_1.internal_Membership, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported..
{
insertId: "65ad79ec-0000-24c7-a66e-14223bbf970a@a1"
jsonPayload: {
context: "CDC"
event_code: "UNSUPPORTED_EVENTS_DISCARDED"
message: "Discarded 97 unsupported events for BigQuery destination: 833537404433.Test_Membership_1.internal_Membership, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported.."
read_method: ""
}
logName: "projects/gcp-everwash-wh-dw/logs/
receiveTimestamp: "2022-11-22T22:08:38.620495835Z"
resource: {2}
severity: "WARNING"
timestamp: "2022-11-22T22:08:37.726075Z"
}
---------------------------------------------------------------
What you expected to happen: ?
I am expecting to create Partition for certain tables that are getting inserted in BigQuery via DataStream.
Attaching Screenshot for reference--
ev...@google.com <ev...@google.com> #37
For Postgres/BQ pairing, what are the steps needed to confirm this fix works? Will a running stream with a broken source table self-correct with the new code? A standard cleanup procedure would be very helpful.
- Does the table in question need to be removed (unchecked, saved) and added again in the source configuration?
- Does the stream need to be stopped (paused) and restarted? Deleted and recreated to pickup the new code?
- Does the destination table need to be deleted in BQ?
et...@google.com <et...@google.com> #38
The fix has been rolled out.\
To recover from this error:
- If a table was already created in BigQuery it should be manually deleted
- Trigger a backfill for the table in Datastream
ni...@google.com <ni...@google.com> #39
BigQuery Product Manager here. It looks like the request here is to add partitioning to an existing BigQuery table. Unfortunately that's not supported. You have to add partitioning to a net-new table. Technically you can create a newly partitioned table from the result of a query [1], however this approach won't work for existing Datastream sourced tables since there wouldn't be a _CHANGE_SEQUENCE_NUMBER field which is required to correctly apply UPSERT operations in the correct order. So the only option would be to pre-create the table with partitioning/clustering/primary keys before starting the Datastream stream like the below DDL SQL query example [2].
One thing to note however is that today partitioning may not be as effective to reduce the data scanned when performing background CDC apply operations because the background merges could be an UPSERT against any record within the base table and partitioning pruning isn't propagated to the background operation. However it is worth noting that clustering should still be beneficial because clustering is used (the PK fields are also denoted as the clustered fields).
[1]
[2] CREATE TABLE `project.dataset.new_table`
(
`Primary_key_field` INTEGER PRIMARY KEY NOT ENFORCED,
`time_field` TIMESTAMP,
`field1` STRING,
#Just an example above. Add needed fields within the base table...
)
PARTITION BY
DATE(time_field)
CLUSTER BY
Primary_key_field #This must be an exact match of the specified primary key fields
OPTIONS(max_staleness = INTERVAL 15 MINUTE) #or whatever the desired max_staleness value is
be...@orbitremit.com <be...@orbitremit.com> #40
jo...@unity3d.com <jo...@unity3d.com> #41
et...@google.com <et...@google.com> #42
@johan.eliasson - does the table actually have more than 4 PKs? If not, can you share the CREATE TABLE statement (you can email it to me directly instead of posting it here)?
If the table has more than 4 PK columns, then this error is currently the expected behavior, but there's a change coming to BQ which will allow more than 4 columns in the PK. I'm not able to share exact timelines for this change, but it's WIP (perhaps @nickorlove can provide more details).
ni...@google.com <ni...@google.com> #43
This is a public thread, so I'll refrain from providing exact timelines, however please note the limit of 4 PKs is a known issue we are working hard to address.
I'll update this thread once more concrete details can be shared with the broader community.
[Deleted User] <[Deleted User]> #44
I encoutered this issue as well, my DDL is:
CREATE TABLE public.spoon_mst ( spoon_code varchar(32) NOT NULL, qr_code varchar(32) NULL, scan_date timestamp(6) NULL, product_name varchar(128) NULL, weight varchar(16) NULL, mfg_date timestamp(6) NULL, exp_date timestamp(6) NULL, is_active bool NOT NULL DEFAULT true, status varchar(255) NULL DEFAULT 'UNUSED'::character varying, code_length int4 NULL, created_date timestamp NULL DEFAULT now(), description varchar(255) NULL, is_check bool NULL DEFAULT false, updated_date timestamp NULL, ext_id varchar(255) NULL, qr_manufacture_date timestamp NULL, is_synced bool NULL DEFAULT true, "version" int4 NOT NULL DEFAULT 0, CONSTRAINT spoon_mst_pkey PRIMARY KEY (spoon_code) );
My primary key was a random string generated from an algorithm. In case I try to sync other tables with primary key in number format (id), it works well. Is using primary key in string format causes this issue?
ni...@google.com <ni...@google.com> #45
It looks like the issue with your DDL is around syntax and that the primary key does not match the table's clustering key.An example DDL to create a table to be used with Datastream would be like this:
CREATE TABLE customers ( ID INT64 PRIMARY KEY NOT ENFORCED, NAME STRING, SALARY INT64) CLUSTER BY ID;
ni...@google.com <ni...@google.com> #46
FYI my earlier comment of a suggested DDL was in the frame of mind of running a DDL within BigQuery to create a BQ table which would be used as the destination for Datastream replication.
If your question was more about the syntax of running a DDL from the source database, please ignore
[Deleted User] <[Deleted User]> #47
Thank you so much, it worked like a charm after I created Big Query table manually with clustering key then starting a new stream.
[Deleted User] <[Deleted User]> #48
et...@google.com <et...@google.com> #49
This is now fixed. BigQuery have increased the limitation to 16 PK columns, and Datastream now aligns to this new limitation.
BigQuery still doesn't support more than four clustering columns, so when replicating a table with more than four primary key columns, Datastream uses four primary key columns as the clustering columns.
Description
Some tables are copied but not all. Looking in logs, those that fail are reporting a BIGQUERY_TOO_MANY_PRIMARY_KEYS error.
All postgres tables have a primary key.