Status Update
Comments
sc...@joinhandshake.com <sc...@joinhandshake.com> #2
Ideally the Cloud SQL team would do the maintenance in a way that HA instances do not shut down, and stay up serving requests.
[Deleted User] <[Deleted User]> #3
ch...@gmail.com <ch...@gmail.com> #4
+1 to auto-failover before maintenance.
[Deleted User] <[Deleted User]> #5
ko...@sakartvelosoft.com <ko...@sakartvelosoft.com> #6
I am not a supe-rskilled rockstar of system programming, but I will drop few cents.
How I see true high-availabiity.
1) a proxy sits on top of 2+ MySQL servers deployed in managed autoscaling group like at google compute. 1 master and 1+ slave/read replica
2) If request is read-only (only select statements without select-into and without procedure calls, dispatch them among read-replicas or lsaves of managed group.
3) if request is not readonly, send to master.
4) if master is going to maintence (planned to go under maintenance, promote new master instance using a read replica/slave and set server to �maintenaince� state.
5) if maintenance completes, sync the instance and join it back to group as read-replica/slave
* at (4) the replica with shortest pending requests synced and promoted to new master.
With this approach all requests are go through the group�s umbrella proxy and it decides where to send a request.
All transactional requests are go to current master as considered as non-readonly.
Master transition must be done I such way:
0) replica promotion started (state changed to �becoming master�)
1) all non-readonly requests queued
2) as soon master promoted, all pending and non-readonly requests sent to it.
Requests queueing must start only when started syncing and promotion of new master, not earlier.
Hope this have sense.
Easiest way to make existing HA feature MUCH more usable, is to make sure that failover server maintenance happens like in 3-5 minutes later after primary instance maintenance completed, similar to rollout up[date of google compute groups
[Deleted User] <[Deleted User]> #7
je...@twinkl.com.au <je...@twinkl.com.au> #8
an...@dapperlabs.com <an...@dapperlabs.com> #9
How this is acceptable for an "HA" deployment is beyond me, much less how this issue has 60 stars and has been open since July of 2018 without even being so much as assigned. Someone at least add an update that the GCP Postgres team is at a minimum aware of the issue and admits it's a problem.
al...@gmail.com <al...@gmail.com> #10
They promised 30-60 second fail-overs & maintenance and actually delivered on it so far.
hn...@gmail.com <hn...@gmail.com> #11
Everyone from our engineering team is genuinely perplexed when we tell them
what HA in Cloud SQL actually means...
On Tue, 22 Oct 2019, 19:55 , <buganizer-system@google.com> wrote:
to...@gmail.com <to...@gmail.com> #12
ch...@gmail.com <ch...@gmail.com> #13
[Deleted User] <[Deleted User]> #14
I imagine what happens when Google decides to roll-out an upgrade to Cloud SQL instances. Any SRE who pushes the button must think about how they gonna bring down production of thousands of highly-available databases...
</sarcasm>
Would be great if we could get an update from the product team.
[Deleted User] <[Deleted User]> #15
Begin maintenance on the failover cluster
patch it completely
stand it up and sync it
promote it to master.
Post swap perform maintenance on the now defunct master
place it back as the failover slave after maintenance.
ke...@gmail.com <ke...@gmail.com> #16
ke...@gmail.com <ke...@gmail.com> #17
ot...@gmail.com <ot...@gmail.com> #18
mi...@gmail.com <mi...@gmail.com> #19
jo...@everflow.io <jo...@everflow.io> #20
[Deleted User] <[Deleted User]> #21
ni...@sada.com <ni...@sada.com> #22
[Deleted User] <[Deleted User]> #23
[Deleted User] <[Deleted User]> #24
cl...@d-teknoloji.com.tr <cl...@d-teknoloji.com.tr> #25
az...@gmail.com <az...@gmail.com> #26
[Deleted User] <[Deleted User]> #27
[Deleted User] <[Deleted User]> #28
or...@gmail.com <or...@gmail.com> #29
la...@gmail.com <la...@gmail.com> #30
hn...@gmail.com <hn...@gmail.com> #31
Btw. P0 priority is (according to
> An issue that needs to be addressed immediately and with as many resources as is required. Such an issue causes a full outage or makes a critical function of the product to be unavailable for everyone, without any known workaround.
ch...@gmail.com <ch...@gmail.com> #32
ox...@gmail.com <ox...@gmail.com> #33
definately pushing people to Other cloud providers isn't it?
ko...@bilt.com <ko...@bilt.com> #34
an...@dapperlabs.com <an...@dapperlabs.com> #35
No official word from Google so I'm inclined to weigh in here with the little I do know and suspect. (I obviously don't speak for Google, so grain of salt and all that.)
The focus of the Cloud SQL is minimizing the duration of the downtime windows, and the frequency in which they happen, but not to completely eliminate the possibility of instances ever going down. For that, there is Cloud Spanner, which is arguably the flagship product of GCP and what I suspect GCP uses internally for everything. There's little incentive for them to pour time and resources into making postgres never-down when that is the bread and butter of Spanner, especially when no one is using it internally.
GCP is keenly aware of this issue and I've raised it through as many channels as I could. The occasional update to the metadata here suggests it does get looked at, and I don't envy the product manager that would have to try and explain why this never gets priority to the mob, so maybe it's just best to let this issue rot. Keep adding to the 123 stars, but 30+ comments of "+1" and "we're leaving to X" is probably enough.
For what it's worth, we track outages caused by this database roll, and while they do still happen, year-over-year they are going nothing but down. Breaking updates that take our HA cluster offline seem to happen less, though they do take about the same amount of time to come back up as they always have. We shifted a few downtime-sensitive DB's elsewhere, and as our data got global enough to merit it, have started exploring spanner and its associated price tag more seriously.
As much as I'd love for this to be solved, we've accepted it where we can, and moved on where we couldn't.
be...@backmarket.com <be...@backmarket.com> #36
Has someone experienced that improvement?
Maintenance still does not involves a failover to standby instances from what we can see in the docs (
sa...@gmail.com <sa...@gmail.com> #37
be...@banked.com <be...@banked.com> #38
Three years of silence from Google, jokeshop.
aj...@google.com <aj...@google.com> #39
Hello Cloud SQL users,
This year, Cloud SQL redesigned our maintenance workflow to reduce maintenance downtime by 80%. Our new workflow utilizes a shared disk failover approach, in which we first update a failover target VM with the maintenance patch while the original is still running. Once the update is complete, we stop the original instance, switch over the disk, failover the IP address, and resume traffic on the updated VM. Typical maintenance downtime per engine is below:
-
PostgreSQL - 30 seconds or less
-
MySQL - 60 seconds or less
-
SQL Server - 120 seconds or less
For more information, you can read our blog on the new maintenance process:
Since the shared disk failover approach has been implemented, we plan to mark this issue tracker ticket as "Fixed" in one week's time. When combining faster maintenance with scheduling controls like maintenance windows and deny maintenance periods, many Cloud SQL customers are no longer impacted by maintenance.
However, for 24/7 business-critical applications that have very high uptime requirements, maintenance may still be disruptive. For these customers, we continue to invest in reducing maintenance downtime. To indicate that your application requires further improvement in mainteance downtime, please star this new issue tracker ticket so we can track interest.
All the best,
Akhil
Cloud SQL Product Manager
Description
This request is for a feature that allows an automatic trigger of the failover on an instance that has the high availability configuration enabled.
[1]