Status Update
Comments
jo...@google.com <jo...@google.com> #2
Thank you to bring this to our radar, I'm making some test in order to reproduce the issue, I will update you next Tuesday (2021-05-25) before 17:00 CDT with my findings
jo...@google.com <jo...@google.com> #3
Hi, I need your help to clarify the issue.
I was following this
JOB_NAME="my_first_keras_job"
JOB_DIR="gs://$BUCKET_NAME/keras-job-dir"
Please note that both of the jobs submitted use the same variables an I found no issues, if possible, couyld you please help me sharing with me the error message you are getting?
[Deleted User] <[Deleted User]> #4
Hi! Of course. Here's how to reproduce this issue:
Create a directory, let's call it ai-bug-example
. touch
a __init__.py
and create a file called run.py
containing the following:
import sys
print(sys.argv)
Now run this locally:
$ gcloud ai-platform local train --job-dir gs://my-gs-bucket --package-path ai-bug-example --module-name ai-bug-example.run
['/home/kamil/gcloud-issue/ai-bug-example/run.py', '--job-dir', 'gs://my-gs-bucket']
As you can see, the job dir is passed as is.
Now let's submit the same job:
$ cloud ai-platform jobs submit training test --region europe-west4 --job-dir gs://my-gs-bucket --package-path ai-bug-example --module-name ai-bug-example.run --python-version 3.7 --runtime-version 1.15
and check the Log Explorer:
master-replica-0
"Running command: python3 -m ai-bug-example.run --job-dir gs://my-gs-bucket/"
master-replica-0
"['/root/.local/lib/python3.7/site-packages/ai-bug-example/run.py', '--job-dir', 'gs://my-gs-bucket/']"
Now an additional slash appeared at the end of the gs:// uri, which would not be an issue if it wasn't that /some/file
and some/file
are different paths on GCS?
I can see the reasoning for it being the entire directory including the leading slash, but I also believe the behaviour between local train
and submit training
should be the same, so either one seems wrong.
jo...@google.com <jo...@google.com> #5
Thank you for the information provided.
The Engineering team is aware of the issue and they will provide an update about this in the next comments. If you have any additional comments or doubts , please feel free to add them an I will be happy to assist you.
Description
Problem you have encountered:
I'm trying to run a custom training job on the AI Platform. I've tested my model locally launching using gcloud, but when submitting the job, the job dir parameter is modified.
I ran it locally using a command like the following:
gcloud ai-platform local train --job-dir gs://some-bucket [etc etc]
This worked fine - my python script was ran with the args, parsed using argparse as "
job_dir
" being "gs://some-bucket
".Then I decided to submit this and train in the cloud:
gcloud ai-platform jobs submit training some_job_name --job-dir gs://some-bucket [etc etc]
What happened:
The job_dir argument got set to "
gs://some-bucket/
" (note the trailing slash). My code trying to extract the bucket name, extracted "some-bucket/
", and the cloud storage python module failed as bucket names must start and end with alphanumeric characters.What you expected to happen:
The job_dir argument provided to my training script is the same as when ran locally. Either both should add a trailing slash, or neither should.