VERTEX Ai platform: Error Messages: Model server exited unexpectedly. [295518541]

Assigned

Bug

Status Update

No update yet.

Description

yo...@gmail.com

created issue #1

Aug 12, 2023 07:50PM

Hi,

I have multiple models that I want to deploy.

I created two resource pools, one for CPU and one for GPU.

Now I deployed two models to the resource pool CPU, and it is working well.

But when I try to create an endpoint and attach it to the resource pool GPU, it fails.

I tried two different models and it is still not working.

The models work if I set dedicated resources with GPU.

Here's the message I got by mail:

Error Messages: Model server exited unexpectedly.

So basically, when I hit create endpoint it keeps loading for some minutes then the error shows.
I found this error in logging explorer:
(1) NOT_FOUND: Error executing an HTTP request: HTTP response code 404 with body '<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: caip-tenant-fc9d9d0b-17f4-4284-9823-401faaf96ac0/5044324052747943936-processed/tfeieOptimizedModel/20230812093241/1/variables/variables.data-00000-of-00001</Details></Error>
when reading gs://caip-tenant-fc9d9d0b-17f4-4284-9823-401faaf96ac0/5044324052747943936-processed/tfeieOptimizedModel/20230812093241/1/variables/variables.data-00000-of-00001

Tried importing model without "Tensorflow optimize runtime" option and I got this error:

P_REQUIRES failed at xla_ops.cc:296 : UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?"

Steps to reproduce:
Download this model to a bucket:

https://tfhub.dev/google/sentence-t5/st5-11b/1
Create a resource pool with GPU V100 and using standard 30gb memory
Import the model to model registry then create an endpoint and use shared resources and point to the resource pool created above.
You should get an error after a few minutes.
The region is: us-central1 (Iowa)

Comments

pu...@google.com <pu...@google.com> Aug 18, 2023 06:45AM

Assigned to pu...@google.com.

pu...@google.com <pu...@google.com> #2Aug 21, 2023 01:21PM

Application get crashed when Recylerview reach at end(last page). Inside GithubRemoteMediator.kt I have set val endOfPaginationReached = page > 2 to allow maximum 2 pages for testing. So, the problem is when I reach at the last page of list, application get crashed as remoteKeys.nextKey is null(as we have set nextKey=null for last page of list).

Do you have a stack trace for the crash you could share?

Offline cache broken How to support caching when application is open without internet connection. currently it is showing retry button. How to show previously loaded data. I have tried to fix this by removing

This might be due to loadStateFlow / listener logic on the PagingDataAdapter which hides / shows UI elements based on load state. You'll want to modify that logic to only listen to remote errors and not the local ones as well.

pu...@google.com <pu...@google.com> Aug 23, 2023 01:41PM

Reassigned to gc...@google.com.

Issue 295518541

Description

Issue summary

Comments

pu...@google.com <pu...@google.com> Aug 18, 2023 06:45AM

pu...@google.com <pu...@google.com> #2Aug 21, 2023 01:21PM

pu...@google.com <pu...@google.com> Aug 23, 2023 01:41PM

Add comment

Issue metadata