Assigned
Status Update
Comments
be...@gmail.com <be...@gmail.com> #3
Hi, this error only affects Cloud SDK version 186. It was previously reported in Issue 72407295 and a fix for it should be released in Cloud SDK version 187.
In the mean time, you can downgrade to Cloud SDK version 185 as a workaround by running the following command:
gcloud components update --version 185.0.0
In the mean time, you can downgrade to Cloud SDK version 185 as a workaround by running the following command:
gcloud components update --version 185.0.0
Description
Couldn't open CUDA library libcupti.so.9.0. LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nvidia/lib64
and
Non-OK-status: status_ status: Failed precondition: could not dlopen DSO: libcupti.so.9.0; dlerror: libcupti.so.9.0: cannot open shared object file: No such file or directory
My ml-engine job configuration
trainingInput:
region: us-east1
pythonVersion: '3.5'
runtimeVersion: '1.8'
scaleTier: CUSTOM
masterType: standard_gpu
Without using the a tensorflow profiler the training script runs fine, with the profiler the above errors occur. It can be verified that libcupti.so.9.0 exists in /usr/local/cuda/extras/CUPTI/lib64 and that currently LD_LIBRARY_PATH does not include this path.
Current workaround is to allocate a 'similar' VM and profile there. It would be nice to be able to profile in the same mlengine environment that I plan to use in training and not have to setup environments manually.
Request: please add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH in mlengine runtime