Status Update
Comments
va...@google.com <va...@google.com>
ha...@ingka.ikea.com <ha...@ingka.ikea.com> #2 Restricted+
ds...@google.com <ds...@google.com> #3
Hello , Hello,
Could you please clarify more on the issue by providing the below information one by one :-
-
Detailed issue description
-
screenshot of the issue which you are facing and
-
reproduce step
If possible please provide a summary of what steps you have followed or workaround you have already tried and link to any documentation or article which you have found (un)helpful. We need this information to investigate the issue.
vi...@ludotech.co <vi...@ludotech.co> #4
jo...@sinedigital.co.uk <jo...@sinedigital.co.uk> #5
ga...@uscreen.tv <ga...@uscreen.tv> #6
I can share how I'm able to reproduce it:
- create an empty table
CREATE TABLE my_dataset.test_table (
field1 STRING,
field2 STRING
)
- Fill the table with test data. I prepared this golang script.
package main
import (
"context"
"encoding/base64"
"fmt"
"os"
"sync"
"cloud.google.com/go/bigquery"
"go.uber.org/zap"
"go.uber.org/zap/zapcore"
"golang.org/x/oauth2/google"
"golang.org/x/sync/semaphore"
"google.golang.org/api/option"
)
var (
log *zap.SugaredLogger
bqStorage *BigQueryStorage
)
type BigQueryRow struct {
Field1 string
Field2 string
}
type BigQueryStorage struct {
BQClient *bigquery.Client
Log *zap.SugaredLogger
sem *semaphore.Weighted
}
func NewBigQueryStorage(ctx context.Context, log *zap.SugaredLogger) (*BigQueryStorage, error) {
creds, err := base64.StdEncoding.DecodeString(os.Getenv("BQ_CREDS"))
if err != nil {
log.Fatalf("Big Query]: Could not decode credentials: %v", err)
}
if creds != nil {
log.Infof("[Big Query]: credentials provided")
}
config, err := google.JWTConfigFromJSON(creds, bigquery.Scope)
if err != nil {
log.Fatalf("[Big Query]: Could not create config from JSON: %v", err)
}
log.Infof("[Big Query]: Creating client %v", config.Scopes)
bqClient, err := bigquery.NewClient(ctx, "project-name", option.WithTokenSource(config.TokenSource(ctx)))
if err != nil {
return nil, err
}
log.Info("[Big Query]: Client created")
return &BigQueryStorage{BQClient: bqClient, Log: log, sem: semaphore.NewWeighted(1024)}, nil
}
func (s *BigQueryStorage) BatchInsertRowsConcurrently(ctx context.Context, bqRows []BigQueryRow) error {
const batchSize = 500
var wg sync.WaitGroup
errorsCh := make(chan error, len(bqRows)/batchSize+1)
for i := 0; i < len(bqRows); i += batchSize {
end := i + batchSize
if end > len(bqRows) {
end = len(bqRows)
}
wg.Add(1)
go func(start, end int) {
defer wg.Done()
sessionsTempTable := "test_table"
ins := s.BQClient.Dataset("my_dataset").Table(sessionsTempTable).Inserter()
s.Log.Infof("[BQtest]: Inserting rows from %d to %d; total: %d", start, end, len(bqRows[start:end]))
if err := ins.Put(ctx, bqRows[start:end]); err != nil {
s.Log.Errorf("[BQtest]: failed to insert into temp table: %v", err)
errorsCh <- err
}
}(i, end)
}
wg.Wait()
close(errorsCh)
for err := range errorsCh {
if err != nil {
return err
}
}
return nil
}
func main() {
ctx := context.Background()
initLogger()
st, err := NewBigQueryStorage(ctx, log)
if err != nil {
log.Fatalf("[BQtest]: Could not create BigQuery client: %v", err)
}
bqStorage = st
var bqRows []BigQueryRow
for i := 0; i < 10000; i++ {
bqRows = append(bqRows, BigQueryRow{
Field1: fmt.Sprintf("value%d", i),
Field2: fmt.Sprintf("value%d", i),
})
}
if err := bqStorage.BatchInsertRowsConcurrently(ctx, bqRows); err != nil {
bqStorage.Log.Errorf("[BQtest]: Failed to insert rows: %v", err)
}
log.Infof("[BQtest]: Rows inserted. Start truncating temp table")
truncateSQL := fmt.Sprintf(`TRUNCATE TABLE %s;`, "my_dataset.test_table")
truncateQuery := bqStorage.BQClient.Query(truncateSQL)
job, err := truncateQuery.Run(ctx)
if err != nil {
bqStorage.Log.Errorf("[BQtest]: failed to truncate temp table: %w", err)
return
}
status, err := job.Wait(ctx)
if err != nil {
bqStorage.Log.Errorf("[BQtest]: failed to wait for truncate job: %w", err)
return
}
if status.Err() != nil {
bqStorage.Log.Errorf("[BQtest]: failed to execute the truncate job: %w", status.Err())
return
}
}
func initLogger() {
highPriority := zap.LevelEnablerFunc(func(lvl zapcore.Level) bool {
return lvl >= zapcore.ErrorLevel
})
lowPriority := zap.LevelEnablerFunc(func(lvl zapcore.Level) bool {
return lvl < zapcore.ErrorLevel
})
consoleDebugging := zapcore.Lock(os.Stdout)
consoleErrors := zapcore.Lock(os.Stderr)
consoleEncoder := zapcore.NewConsoleEncoder(zap.NewDevelopmentEncoderConfig())
core := zapcore.NewTee(
zapcore.NewCore(consoleEncoder, consoleErrors, highPriority),
zapcore.NewCore(consoleEncoder, consoleDebugging, lowPriority),
)
log = zap.New(core).Sugar()
}
I receive this error every time I run this script. Before 2024-02-23 it worked perfectly fine. This is the error:
2024-02-27T17:57:57.882+0400 ERROR [BQtest]: failed to wait for truncate job: %!w(*googleapi.Error=&{400 TRUNCATE DML statement over table video_analytics.test_table would affect rows in the streaming buffer, which is not supported [] {
"error": {
"code": 400,
"message": "TRUNCATE DML statement over table my_dataset.test_table would affect rows in the streaming buffer, which is not supported",
"errors": [
{
"message": "TRUNCATE DML statement over table my_dataset.test_table would affect rows in the streaming buffer, which is not supported",
"domain": "global",
"reason": "invalidQuery",
"location": "q",
"locationType": "parameter"
}
],
"status": "INVALID_ARGUMENT"
}
}
ma...@gutsy.com <ma...@gutsy.com> #7
I have added the below information in the issue description, but adding again.
Detailed issue description
Getting an error when trying to run a TRUNCATE query on a table with rows in the streaming buffer. Until 23.2.2024 it worked without any errors and after the operation fails with an error.
screenshot of the issue which you are facing
The error is:
googleapi: Error 400: TRUNCATE DML statement over table XXX would affect rows
in the streaming buffer, which is not supported, invalidQuery
reproduce step
- Stream rows into a table
- Try to run
TRUNCATE TABLE XXX
Additional information
From the current docs, it seems that the TRUNCATE statement is part of the DML statements, as can be seen here:
But from a previous reference:
ds...@google.com <ds...@google.com> #8
Hi , the DML Truncation with streaming data is not supported because it’s fully rolled out recently. You can star this pit
to implement this as a feature request, or you can raise a another pit to implement this as a feature request .But we cannot provide an estimated time of implementation or guarantee the fulfillment of the feature request
, please be assured that your input is highly valued. Your feedback enables us to enhance our products and services.
ma...@gutsy.com <ma...@gutsy.com> #9
Why isn't such a change documented? Can you point me to some documentation about that?
As you can see this behaviour was something that was used by users and suddenly without notice changed. Should we expect other statements to change in such a way? Because it is not reliable.
vi...@ludotech.co <vi...@ludotech.co> #10
ma...@brevo.com <ma...@brevo.com> #11
ds...@google.com <ds...@google.com> #12
Hi , as informed by the Eng team ,
Before the fix, with TRUNCATE TABLE DML, Row-level
=> With the fix, TRUNCATE TABLE DML can no longer modify tables that have data in the streaming buffer. Customers will get an error message: ”TRUNCATE DML statement over table would affect rows in the streaming buffer, which is not supported.” Row-level Access Policies on the table will also not be removed.
I hope the above information is able to fulfill your inquiries.
jo...@sinedigital.co.uk <jo...@sinedigital.co.uk> #13
The bigger problem here, which has made TRUNCATE so useful up to now, is the indeterminate nature of the streaming buffer itself. We have no way to know when or whether it has flushed. If we had a way to poll its contents or force a flush, the changes to TRUNCATE would be much less of an issue.
Whilst we also appreciate that BigQuery is a DWH and not an RDBMS, modern pipelines do need to use DML. BigQuery with a streaming buffer flushing policy of 'up to 90 minutes', is too indeterminate to make this possible.
ds...@google.com <ds...@google.com> #14
Hello,
This issue report has been forwarded to the Bigquery Engineering team so that they may investigate it, but there is no ETA for a resolution today. Future updates regarding this issue will be provided here.
sa...@google.com <sa...@google.com> #15
Hey team!
Wanted to flag that another customer of the Digital Assets team - Cymbal - is also facing this issue.
Please keep us posted on the ETA for resolution.
sa...@google.com <sa...@google.com> #16
ha...@ingka.ikea.com <ha...@ingka.ikea.com> #17 Restricted+
om...@starsandstories.com <om...@starsandstories.com> #18
mv...@google.com <mv...@google.com> #19
Thank you everyone for your feedback. We take your input seriously and use it to continuously improve BigQuery and maintain its position as a trusted and reliable data warehouse.
On February 22nd, we implemented a change to align the behavior of the TRUNCATE TABLE DML statement for streamed data with other DML statements such as UPDATE, DELETE, and MERGE, while also introducing new DML feature capabilities.
While we initially created an allowlist to maintain the previous behavior for existing users, we understand that some customers were mistakenly not included in this allowlist. Thanks to your valuable feedback, we recognized the disruption this change caused and have added all impacted customers to the allowlist. This enables those who previously used the TRUNCATE TABLE DML statement with data in the streaming buffer to resume using it as they did previously.
We appreciate your patience and understanding. Please do not hesitate to open a support case if you require additional information or assistance.
BigQuery team
jo...@sinedigital.co.uk <jo...@sinedigital.co.uk> #20
As there is no practical work-around (drop/create is not a long term solution), what are the engineering team planning to permanently resolve this problem?
jo...@sinedigital.co.uk <jo...@sinedigital.co.uk> #21
Currently, we do not know how reliable use of the TRUNCATE statement is, as there is no detail above on who has the use of it in its previous form and how long that will work for.
Please can you address the above points, so that we know whether this statement can be used again or whether we should ignore it going forward. We use a significant amount of streaming inserts, so there will always be the risk of rows in the streaming buffer. These will always need to be deleted along with any rows in the table.
di...@gmail.com <di...@gmail.com> #22
mo...@gmail.com <mo...@gmail.com> #23
My situation:
1) Running a cloud function to insert data into BigQuery. Sometimes the cloud function runs into an OutOfMemory error causing it to crash with the streaming buffer having some rows.
2) When I retrigger the cloud function, I have to clear the big query table before inserting again.
3) I run the truncate immediatley after if the BigQuery insert fails. But, this results in the above error having me to wait 45 minutes everytime there is a insert failure.
Description
Problem you have encountered:
On 23.2.2024 we started getting errors for trying to TRUNCATE a table with rows in the streaming buffer.
We have daily tests that are running for a few months now that create a table, stream rows into it and TRUNCATE it, it started failing on 23.2.2024 and never failed before.
What you expected to happen:
Expected TRUNCATE to not fail due to rows in the streaming buffer.
Steps to reproduce:
Have rows in the streaming buffer of a table and try to truncate it.
Other information (workarounds you have tried, documentation consulted, etc):
In the current documentation it seems that TRUNCATE is considered a DML statement but from previous docs it was stated that it is a DDL statement. For example here:https://cloud.google.com/blog/topics/developers-practitioners/bigquery-explained-data-manipulation-dml
This sort of change is considered a breaking change for us and we would like to understand if this is by design or a bug.