Unclarity in training pipeline [212696358]

Assigned

Bug

Status Update

No update yet.

Description

[Deleted User]

created issue #1

Dec 31, 2021 01:49PM

On this page (

https://cloud.google.com/vertex-ai/docs/training/using-managed-datasets#access_a_dataset_from_your_training_application) it is mentioned that the environmental variables
AIP_DATA_FORMAT, AIP_TRAINING_DATA_URI, AIP_VALIDATION_DATA_URI and AIP_TEST_DATA_URI are available.

And below, the schema of the jsonl files are shown. However, it is not clear if the uri to the jsonl file or the contents are also available. At least not in one of those 4 variables as far as I know.

Therefore, it is very unclear if it is possible to extract the contents of for example the environment variable as given in the example below:

"annotationResourceLabels": {
"

aiplatform.googleapis.com/annotation_set_name": "displayName",
"env": "prod"
}
},

Comments

ew...@google.com <ew...@google.com> #2Jan 3, 2022 02:14PM

Accepted by ew...@google.com.

Hi!

Thank you for reaching out!

I see you are having an issue understanding the public documentation related to AI platform. I will try to solve your question:

First, regarding the environment variables, you will have to provide them always, and it's going to be used by your model to locate your dataset. This variables will have the path for the jsonl files or bigquery tables:

If the AIP_DATA_FORMAT of your dataset is jsonl or csv, the data URI values refer to Cloud Storage URIs, like gs://bucket_name/path/training-*

For non text data:

"env":[
   {
      "name":"AIP_VALIDATION_DATA_URI",
      "value":"gs://bucket/validation.jsonl"
   },
   {
      "name":"AIP_TEST_DATA_URI",
      "value":"gs://bucket/test.jsonl"
   },
   {
      "name":"AIP_DATA_FORMAT",
      "value":"jsonl"
   },
   {
      "name":"AIP_TRAINING_DATA_URI",
      "value":"gs://bucket/training.jsonl"
   }
]

However, it is not clear if the uri to the jsonl file or the contents are also available. At least not in one of those 4 variables as far as I know.

There are 2 types of data you can ingest on your model, text data and non text data. For text data, you can use csv files or a uri for a BigQuery table. If you are using non textual data, like images, you have to provide the URI for the jsonl file who describes your dataset and labels for example, following the pattern presented in the documentation.

Therefore, it is very unclear if it is possible to extract the contents of for example the environment variable as given in the example below

The environment variables are used by the container/model to know where to find the dataset, the data inside the jsonl files are your dataset, it will provide for example the gcs uri for each image in your dataset. So, no, the environment variables are used for another purpose (described above), your jsonl dataset should have the uri for your data.

Please let me know if it is clear for you. In case not, please point out which part is not clear and if you want to proceed with a request to change the documentation, and more details of the required change.

Kind Regards

[Deleted User] <[Deleted User]> #3Jan 3, 2022 03:08PM

Hi, thank you for your response

I'm not sure I understand your response fully. I'll try to elaborate a bit more.

When a managed dataset is used when a custom training job is created, then I assume the 4 variables AIP_DATA_FORMAT, AIP_TRAINING_DATA_URI, AIP_VALIDATION_DATA_URI and AIP_TEST_DATA_URI will be available in the container of the custom training job. This is how it is mentioned in the docs:

At runtime, Vertex AI passes metadata about your dataset to your training application by setting the following environment variables in your training container.

Your response mentions "regarding the environment variables, you will have to provide them always". I do not believe this is correct. These environmental variables are set by assigning the dataset, users don't need to set these environmental variables.

That being said, my first question was about this explanation:

AIP_DATA_FORMAT: The format that your dataset is exported in. Possible values include: jsonl, csv, or bigquery.

If I interpret this literally, that means that AIP_DATA_FORMAT is either jsonl or csv. This seems to be correct from your response.

My other question was about the other 3 variables. How I interpret the docs:

AIP_TRAINING_DATA_URI: The location that your training data is stored at.

I would assume that this will be, for example gs://bucket_name/path/training-*, with the bucket containing for example gs://bucket_name/path/training-image_1.jpg, gs://bucket_name/path/training-image_2.jpg, etc. Because the docs mention "location that your training data is stored at".

However, from your response I now interpret that AIP_TRAINING_DATA_URI would be gs://bucket/training.jsonl and that this file contains the locations of the training data.

Then my suggestion for the docs would be:

AIP_TRAINING_DATA_URI: link to a file that contains all training data information.

I would then give an example of how a file of AIP_TRAINING_DATA_URI could look like, because now the page describes:

  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotation": {
    "displayName": "LABEL",
    "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
   },
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

but this mixes training/testing and validation. In your response, I'm assuming that the variables AIP_TRAINING_DATA_URI, AIP_VALIDATION_DATA_URI and AIP_TEST_DATA_URI will have split them up!

Thanks in advance.

If my explanation is not clear: I'm always available for a video chat to elaborate :-).

Cheers

Message last modified on Jan 3, 2022 03:09PM

ew...@google.com <ew...@google.com> #4Jan 5, 2022 12:27PM

Assigned to gc...@google.com.

Hi!

Thank you for the provided information.

I will open a internal request regarding this issue and inform the suggestions you have provided to the product team. Please keep in mind that it have to be analyzed and considered by the product team and I can't provide you ETA for it to be delivered. However, you can keep track of the status by following this thread.

Kind regards

[Deleted User] <[Deleted User]> #5Jan 7, 2022 12:36PM

An update here. Digging in a bit more. The environmental variables indeed refer to training*.jsonl files that are stored in the staging bucket of the training pipeline.

Unfortunately the behaviour is not as expected. I have successfully created a training pipeline linking to a dataset that was labeled and divided into a training, testing and validation set. However, when looking at the output of the jsonl files, I noticed that this split is not respected. An example snippet of the validation*.jsonl

{"imageGcsUri":"gs://vertex-ai-book-rps-images/scissors/1546002498.1090221.jpg","classificationAnnotations":[{"displayName":"scissors","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/scissors/1546002499.7151406.jpg","classificationAnnotations":[{"displayName":"scissors","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"validation"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/scissors/1546002491.9542408.jpg","classificationAnnotations":[{"displayName":"scissors","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/paper/1546002486.7451549.jpg","classificationAnnotations":[{"displayName":"paper","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/paper/1546002486.9894636.jpg","classificationAnnotations":[{"displayName":"paper","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/rock/1546002733.7857163.jpg","classificationAnnotations":[{"displayName":"rock","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"test"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/rock/1546002476.2734113.jpg","classificationAnnotations":[{"displayName":"rock","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/paper/1546002480.9602935.jpg","classificationAnnotations":[{"displayName":"paper","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}
{"imageGcsUri":"gs://vertex-ai-book-rps-images/paper/1546002481.4460943.jpg","classificationAnnotations":[{"displayName":"paper","annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"225610989926612992"}}],"dataItemResourceLabels":{"aiplatform.googleapis.com/ml_use":"training"}}

One would expect that the training/validation and test splits of the datasets would be used and not have them shuffled again.

I also assume that the following parameter would help, but it is not clear how to use this https://github.com/googleapis/python-aiplatform/blob/b72067bd6b7013fbd3a00f9178644320300f96c0/google/cloud/aiplatform/training_jobs.py#L1792

Issue 212696358

Description

Issue summary

Comments

ew...@google.com <ew...@google.com> #2Jan 3, 2022 02:14PM

[Deleted User] <[Deleted User]> #3Jan 3, 2022 03:08PM

ew...@google.com <ew...@google.com> #4Jan 5, 2022 12:27PM

[Deleted User] <[Deleted User]> #5Jan 7, 2022 12:36PM

Add comment

Issue metadata