Keep Batch Prediction Data Order [202080076]

WAI

Feature Request

Status Update

No update yet.

Description

ta...@google.com

created issue #1

Oct 5, 2021 02:52AM

This will create a feature request which anybody can view and comment on.

Please describe your requested enhancement. Good feature requests will solve common problems or enable new use cases.

What you would like to accomplish:
Currently, Vertex AI Batch Predictions does not guarantee order of output data.
This is not documented as a limitation of Vertex AI but AI Platform document[1] suggests this is WAI, also I can observe this phenomenon with Vertex AI.
It would be more useful if output files are generated with the same order as an input file.

[1]

https://cloud.google.com/ai-platform/prediction/docs/batch-predict#getting_prediction_results

Comments

ja...@ambiata.com <ja...@ambiata.com> #2Jan 31, 2022 02:52AM

I agree it would be good if each output response could be linked to an input file at least, or yes returned in the original order

yb...@tangerine.ca <yb...@tangerine.ca> #3Sep 29, 2022 05:54PM

This is very important to resolve, otherwise what's the use of the batch prediction job - if we cant guarantee the order of the probability scores for each input line..

Message last modified on Sep 29, 2022 05:55PM

ma...@telus.com <ma...@telus.com> #4Nov 22, 2022 05:55PM

Like to just add another case that is similar in ask:
Google Cloud Support 42224058: Are not able to use Batch Predictions Vertex Pipeline Component as predictions are not labelled.

ra...@gmail.com <ra...@gmail.com> #5Nov 22, 2022 08:37PM

The feature is now available in Preview in Vertex AI (even if the order is not maintained, you can now include fields like id in the response): https://cloud.google.com/vertex-ai/docs/predictions/get-predictions#filter_and_transform_input_data_preview

Message last modified on Nov 22, 2022 08:39PM

sh...@google.com <sh...@google.com> #6Nov 30, 2022 05:14PM

We internally use some form of MapReduce framework to run the batch prediction and the data is shuffled across multiple workers so by nature they are not sorted. The way to get the result in some order is mentioned in comment #5 by providing an ID column.

sh...@google.com <sh...@google.com> Dec 12, 2022 08:59PM

Status: Won't Fix (Intended Behavior)

gr...@bebr.nl <gr...@bebr.nl> #7Mar 21, 2023 05:14PM

So just to confirm, until this patch is in General Availability (eta 3-6 months?), there is no built-in way to use ModelBatchPredictOp or BatchPredictionJob to do batch predictions in a production environment? Because if the predictions cannot be linked back to the thing they are predicting for then there isn't much point in making them!

One potential work around I can see is to make a copy of the prediction features in your pipeline before they go into the ModelBatchPredictOp with the ids attached, then take the instances+predictions output of ModelBatchPredictOP and join together using the features as merge keys. But this seems pretty fraught with potential errors. But I'm curious if there a better way? In a custom model you could probably handle this in the predictor I guess?

Finally this patch says it makes these passthrough arguments available on BatchPredictionJob (which is great!), but will they also be added to ModelBatchPredictOp?