Status Update
Comments
sa...@doit.com <sa...@doit.com> #2
Getting the same request from customers almost every day.
ha...@digits.com <ha...@digits.com> #3
ad...@filio.io <ad...@filio.io> #4
You guys created something awesome that no one will use because it just does not make sense from a pricing standpoint. What if I have 10 customers and I am going to create 10 different endpoints for them (let's say I don't want to split the traffic); does it mean I have to pay 9000 dollars a month :)
fa...@axa.ch <fa...@axa.ch> #5
ma...@saschaheyer.de <ma...@saschaheyer.de> #6
[Deleted User] <[Deleted User]> #7
[Deleted User] <[Deleted User]> #8
Would greatly improve the product.
ak...@scapes.studio <ak...@scapes.studio> #9
ty...@gmail.com <ty...@gmail.com> #10
he...@gmail.com <he...@gmail.com> #11
(At least if you don't need a GPU or a huge amount of memory)
Though you're using features like Explainable AI or Model Monitoring this way.
he...@gmail.com <he...@gmail.com> #12
[Deleted User] <[Deleted User]> #13
sa...@doit.com <sa...@doit.com> #14
bs...@gmail.com <bs...@gmail.com> #15
[Deleted User] <[Deleted User]> #16
wo...@gmail.com <wo...@gmail.com> #17
sw...@affable.ai <sw...@affable.ai> #18
ef...@gmail.com <ef...@gmail.com> #19
[Deleted User] <[Deleted User]> #20
This should be done on priority.
ae...@google.com <ae...@google.com> #21
da...@google.com <da...@google.com> #22
dr...@gmail.com <dr...@gmail.com> #23
si...@gmail.com <si...@gmail.com> #24
My app requires just a few predictions a day, will have to leave Vertex AI because of that.
AND I lost all of my trial credits in a few days, without noticing.
+10
ch...@evolutioniq.com <ch...@evolutioniq.com> #25
sa...@doit.com <sa...@doit.com> #26
No waste of energy.
mc...@google.com <mc...@google.com> #27
thanks everyone for all the feedback. For those tracking this bug can you please update your bug with the following info? Feel free to also send me an email at
- (1) Are CPUs sufficient or are GPUs required?
- (2) What is your cold start latency tolerance (i.e. once you're at zero and have a new request how long are you willing to wait for a response?)
- (3) How much of your current pain point can be solved with co-hosting multiple models on the same endpoint (once this is rolled out for all frameworks)
https://cloud.google.com/blog/products/ai-machine-learning/introducing-co-hosting-models-on-the-vertex-ai-prediction-service ? If no, please explain why co-hosting models is not sufficient. - (4) What alternative solution have you leveraged (i.e. Cloud Run for CPUs only, DIY on GCE, another PaaS/SaaS solution)?
Thanks in advance. This feedback is invaluable and thank you for evaluating Vertex AI!
sa...@doit.com <sa...@doit.com> #28
Thanks for asking for our input highly appreciated.
(1) CPUs would already be a great starting point. With the ever-increasing usage of large models like transformers, there is certainly also a demand for GPUs. So both :)
(2) The cold start latency also depends on the model size as it has to be loaded into memory. Just one of many factors in addition to the infrastructure needed. I would take the same tolerance we currently have with Cloud Run as a benchmark.
(3) Co-hosting models, I can't tell you for how long I have looked forward to that feature, great success. Though it doesn't solve the issue of scaling down to zero if there are no requests. Imagine a customer working in just a regional area like the EU. And might not receive a lot of traffic during the night. With co-hosting, the model might use the resources more efficiently but it's still up and running and produces costs.
(4) Cloud Run (with the cons of not having the possibility to use the additional features Vertex AI is providing like Model Monitoring or Explainable AI). And obviously no support for GPUs.
st...@gmail.com <st...@gmail.com> #29
I was actually just looking into using Vertex AI but require the use of scale to 0 to keep costs low. We are probably going to use Cloud Run for the moment. To answer your questions.
1. We would really like GPU support, simply because for us we will simply use Cloud Run if CPUs are the only option.
2. As the above poster said, we would probably be benchmarking against Cloud Run.
3. I am unsure how this would solve our issue being honest, however I am new to this game so I might be missing something.
4. Just starting to look at Cloud Run
Thanks for having a look at this.
Cheers,
Stephen
[Deleted User] <[Deleted User]> #30
(1) GPUs would be better, but CPU would be fine if this means a significantly lower cold start latency
(2) A cold start of 10-15 seconds would be the benchmark to beat our current solution. Below 10 seconds would be optimal.
(3) Co-Hosting would not solve anything since our clients require us to host their model in a separate personal gcp project and most of the time we only have one or two models per client. Sharing an endpoint would therefore not be possible.
(4) Currently we utilize a cloud function that grabs the model from a storage bucket on a cold start. We are currently exploring if other providers do have a sufficient alternative, if this remains being an issue.
Thanks for listening to us.
Greetings,
Björn
mc...@google.com <mc...@google.com> #31
Thanks Sascha, Stephen and Bjoern! Super helpful feedback.
@Others: Please keep adding comments so we can better understand needs and assess feasible options.
ad...@filio.io <ad...@filio.io> #32
(1) Are CPUs sufficient or are GPUs required? Does not matter; for small-sized companies, cost matters more than speed typically; so having multiple option is important
(2) What is your cold start latency tolerance (i.e. once you're at zero and have a new request how long are you willing to wait for a response?) : Something like GAE approach should be good
(3) How much of your current pain point can be solved with co-hosting multiple models on the same endpoint (once this is rolled out for all frameworks)
(4) What alternative solution have you leveraged (i.e. Cloud Run for CPUs only, DIY on GCE, another PaaS/SaaS solution)? Was trying to use cloud run but it is not sustainable at all; we were able to automate the whole process of data collection, integration, training and prediction automatically, but with cloud run, you still need to do quite a bit of manual stuff until you have your endpoint up and running
[Deleted User] <[Deleted User]> #33
Thanks for soliciting this feedback. Here are my answers:
(1) Are CPUs sufficient or are GPUs required? CPUs would be sufficient for our current use case. Inference for many scikit-learn & xgboost models is very fast on CPU, including the model we are currently trying to deploy (which is a scikit-learn wrapper around something more custom). Ideally GPUs would also eventually be available (for anything requiring deep learning models), but I would hope GPU-support would not block roll-out of a CPU-only scale-to-zero feature as soon as that's available, as that would still be widely valuable I'm sure.
(2) What is your cold start latency tolerance (i.e. once you're at zero and have a new request how long are you willing to wait for a response?) Under 0.5 seconds would be ideal, 0.5-1.0 seconds still fine, 1 - 2.5 seconds would be tolerable but less ideal.
(3) How much of your current pain point can be solved with co-hosting multiple models on the same endpoint (once this is rolled out for all frameworks)
(4) What alternative solution have you leveraged (i.e. Cloud Run for CPUs only, DIY on GCE, another PaaS/SaaS solution)? We have not tried any alternatives yet, Vertex AI is the first thing we planned to try, per plenty of advice from GCP documentation and blogs that this sort of thing is what Vertex AI is for. I also know that I read over a year ago that GCP supports scale-to-zero for serverless model inference/deployment; from the opening post in this feature-request thread I see apparently that must have been for GCP's previous 'AI Platform', and no longer supported in the new Vertex AI. From my perspective, this is an enormous regression.
r0...@email.wal-mart.com <r0...@email.wal-mart.com> #34
mc...@google.com <mc...@google.com> #35
Hey Rahul,
This is currently referring to online serving for models in Vertex Predictions (
mo...@objectcomputing.com <mo...@objectcomputing.com> #36
CPUs only would be fine to give me a reason to use vertex for model serving over my own cloud run implementation. GPUs would be nice to have but if you're using GPUs for online prediction you're going to be eating sizeable cost anyway so less important for the scale to 0 option IMO
(2) What is your cold start latency tolerance (i.e. once you're at zero and have a new request how long are you willing to wait for a response?)
Considering this is more of a budget option, 10-15s is probably reasonable
(3) How much of your current pain point can be solved with co-hosting multiple models on the same endpoint (once this is rolled out for all frameworks)
It might help but doesn't completely solve the problem of paying for at least 1 node all month long.
(4) What alternative solution have you leveraged (i.e. Cloud Run for CPUs only, DIY on GCE, another PaaS/SaaS solution)?
Cloud Run works great but doesn't have the nice features of vertex for multiple model versions on an endpoint and such.
ur...@khealth.com <ur...@khealth.com> #37
2. less than 5 seconds, but it'll still give us value even if it's more
3. co-hosting can be a great solution for our pain (once this is rolled out for all frameworks)
4. we haven't implemented nothing yet, one way we're considering is a scheduled a task that undeploys and redeploys models from endpoints outside of working hours for our lower environments endpoints
mo...@loreal.com <mo...@loreal.com> #38
- scaling to 0, with a reasonable cold start time
- GPUs on endpoints
em...@jaguarlandrover.com <em...@jaguarlandrover.com> #39
- scaling to 0, with a reasonable cold start time
- we are also considering scheduled a task that undeploys and redeploys models from endpoints outside of working hours
jc...@jaguarlandrover.com <jc...@jaguarlandrover.com> #40
>> Scale to zero for cost saving measures
>> Cold start thats suitable for a production environment
sa...@google.com <sa...@google.com> #41
- CPUs on endpoints
tj...@pinnacol.com <tj...@pinnacol.com> #42
- scaling to 0, with a reasonable cold start time
mc...@google.com <mc...@google.com> #43
Update: We are actively designing and prototyping this however don't have specific timelines to share. Will let folks here know when we have something ready to test.
fe...@gmail.com <fe...@gmail.com> #44
me...@google.com <me...@google.com> #45
+1 for customer KAUST in Saudi Arabia.
Customer would like to deploy AI model inference (with GPU backend) that scales to zero.
[Deleted User] <[Deleted User]> #46
Our usage is likely to be batched and there will likely be long periods of downtime between spikes of usage.
In our case specifically a slow cold start would probably be fine as well, as we're most likely to batch these predictions up, but I suspect that won't be true for most.
Being able to keep these at zero most of the time would be highly appreciated (and better for the planet ;) )
ya...@google.com <ya...@google.com> #47
- scaling to 0, with a reasonable cold start time
bo...@gmail.com <bo...@gmail.com> #48
- CPUs are sufficient
- Cold start latency tolerance: <5 seconds would be fine.
- Co-hosting multiple models: not a solution because we often have only one model per project.
- Alternative solution: as long as scaling to zero is not implemented in Vertex AI we use the old AI Platform.
mi...@gmail.com <mi...@gmail.com> #49
it makes extremely expensive for new business because of the passive expense of having to allocate a machine all of the time even when no one is using
ad...@filio.io <ad...@filio.io> #50
@mc...@google.com Any update on this feature?
pv...@aiuta.com <pv...@aiuta.com> #51
I'm ok with CloudRun and want just it with GPU/TPU.
Sorry for offtopic, but I am not happy with Vertex endpoints inability to delete failed endpoints. And with the endpoint deployment time once I have my custom image ready. And with the current protocol (PredictResponse) designed only for classification/regression related tasks, not for generative models that return images or other media content.
cr...@google.com <cr...@google.com> #52
zd...@gmail.com <zd...@gmail.com> #53
Supporting this idea. Our use case: We're using Vertex AI and AutoML to train video, image, and tabular models. When a video, or image is uploaded (it isn't frequent, let's say 10-20 times per day) prediction is done. We have around 15 video/image models and to have online prediction Endpoints for all models would be very uneconomical. So at the moment for Videos, we're using Batch prediction (with one item) and it takes ~3 minutes. For some reason when using Batch prediction for Images, it takes ~20-25 minutes even if there is one image, so to speed up a bit when doing prediction for images, we're creating an Endpoint, deploying a model, doing online prediction, undeploying the model and deleting the endpoint. It's a bit cumbersome but it works although the process takes around 10-15 minutes since model deployment takes long. For tabular models, we're exporting those and deploying them on Cloud Run.
Anyway, would be simpler if Endpoints would work like Cloud Run or if it would be straightforward to deploy models on Cloud Run. Supporting GPU would be a nice feature, cold starts are not so critical
ca...@trademe.co.nz <ca...@trademe.co.nz> #54
We wish to use use vertex for model deployment, and expect constantly running nodes in production, but we would like our staging and test environments to scale to zero as those environments have no real customers to serve.
pa...@yosh.ai <pa...@yosh.ai> #55
with cpu + GPU support and scalability to 0.
Alternative would be Cloud Run with GPU support.
In both cases low cold start would be beneficial.
ku...@google.com <ku...@google.com>
va...@google.com <va...@google.com>
ra...@hellosivi.com <ra...@hellosivi.com> #56
br...@gmail.com <br...@gmail.com> #57
va...@google.com <va...@google.com>
va...@google.com <va...@google.com>
na...@gmail.com <na...@gmail.com> #58
To answer the questions
(1) GPUs absolutely required
(2) Latency tolerance - something reasonable, ~ up to 60 secs
(3) Don't see how co-hosting can solve this issue, I need to be able to scale down to 0 nodes when no traffic
(4) Looking into DIY solutions right now, but no idea if I will be able to find an alternative
ga...@gmail.com <ga...@gmail.com> #59
le...@psicokit.com <le...@psicokit.com> #60
fa...@axa.ch <fa...@axa.ch> #61
Thanks
am...@google.com <am...@google.com> #62
- scaling to 0, with a reasonable cold start time
su...@google.com <su...@google.com>
sa...@doit.com <sa...@doit.com> #63
Do you think our feedback was helpful?
Is there anything on the roadmap?
(I believe we are all aware of the complexity around GPUs especially ensuring availability)
jo...@normative.io <jo...@normative.io> #64
As far as I understand, co-hosting only support randomly splitting the incoming traffic over the the deployed models.
This does not support the most basic use case of hosting different models for different purposes on the same endpoint.
I am currently implementing multiple models in the same custom docker image in order to get around this issue.
bu...@gmail.com <bu...@gmail.com> #65
I love the easy incremental training of cloud models, but don't wish to pay hundreds of dollars a month for a rarely used endpoint that doesn't require low latency.
Any suggestions for alternate solutions?
le...@lsx.io <le...@lsx.io> #66
be...@teachfx.com <be...@teachfx.com> #67
fr...@legalhub.la <fr...@legalhub.la> #68
ro...@i-c.email <ro...@i-c.email> #69
e....@gmail.com <e....@gmail.com> #70
This is just not acceptable, just few companies will be happy to throw money out of the window like that
le...@psicokit.com <le...@psicokit.com> #71
[Deleted User] <[Deleted User]> #72
jo...@gmail.com <jo...@gmail.com> #73
gc...@apa.net <gc...@apa.net> #74
ar...@google.com <ar...@google.com> #75
Customer is chosing cloud run, now that it supports L4s instead of Vertex inference. Since it can't scale to 0.
wj...@cwrk.ai <wj...@cwrk.ai> #76
ni...@gmail.com <ni...@gmail.com> #77
Please introduce this feature as it is critical especially for prototyping solutions and testing them out at earlier stages of product / technology.
Questions from @Mikhail Chrestkha:
(1) Are CPUs sufficient or are GPUs required?
(2) What is your cold start latency tolerance (i.e. once you're at zero and have a new request how long are you willing to wait for a response?)
(3) How much of your current pain point can be solved with co-hosting multiple models on the same endpoint (once this is rolled out for all frameworks)
(4) What alternative solution have you leveraged (i.e. Cloud Run for CPUs only, DIY on GCE, another PaaS/SaaS solution)?
Answers to @Mikhail:
1. GPU is necessary (especially since this service focuses on AI)
2. It's ok for the service to start-up for some time. (Perhaps there can be a health-check we can ping). And perhaps there can be another parameter (e.g. period before scale-down: e.g 5 minutes, 10 minutes etc.)
3. Co-hosting would be nice, but does not solve the problem.
4. API based services such as Cerebras, Groq, Nvidia NMI. (Looking at Cloud Run it seems we would only be able to use 1GPU (L4 with 24GB; this is not sufficient for bigger models e.g. 70B or 8x22B).
What is the status & ETA of this ticket?
bd...@google.com <bd...@google.com> #78
is there a timeline for the rollout of this feature?
so...@google.com <so...@google.com> #79
Is there a timeline for this? My customer is interested in this as we are migrating from AWS to GCP
an...@genairx.ai <an...@genairx.ai> #80
al...@sermescro.com <al...@sermescro.com> #81
The single fact of this obvious "product regression" from previous AI Platform is not just a technical/cost isssue, but a greatly discouraging for new users product policy. This would definitely hurt adoption for lots of users (as it already did for us right now). Conversely, if this it's not available now but definitely in the roadmap, sharing it would be highly beneficial.
An entry barrier like this will cause quite often users to look for alternative products/ecosystems, as we're doing, specially for bootstraping users. You're actively driving users away to competition.
Co-hosting is just a band-aid in such cases.
Description
What you would like to accomplish:
For customers requesting that the least number of nodes is 0. The auto-scale online for vertex prediction requires at least 1 node. Customer wants it to be 0. The minimum node use to be 0 in AI Platform.