Tags are being treated as full stops, producing fragmented translations [364535683]

Assigned

Feature Request

Status Update

No update yet.

Description

ch...@google.com

created issue #1

Sep 4, 2024 10:30AM

What you would like to accomplish:
I would like to suggest the addition of support for XML/XLIFF tags in Google AutoML custom models. Given that XML/XLIFF is the standard format widely used in the localization industry, integrating this capability would significantly enhance the tool's utility, similar to what is offered by DeepL. This feature would greatly benefit users by ensuring seamless processing and translation of these industry-standard formats.

How this might work:
Can handle the tags correctly as DeepL engine does.

If applicable, reasons why alternative solutions are not sufficient:
Xliff placeholders are not optimized in the input text in Custom Translation or in Translation API. The closest thing I can get is using HTML translation with the Xliff placeholders.

Other information (workarounds you have tried, documentation consulted, etc):

Set the mime_type in the TranslateText request to "text/html". It is supposed to not fragment the translation (i.e. processed on sentence-level) and keep the ordering of tags as part of the sentence. User's input is translated as below.

The Report <x id="1"/> object links are not valid because field <x id="2"/> on <x id="3"/> conflicts with <x id="4"/> on <x id="5"/>
报告 <x id="1"/> 对象链接无效，因为 <x id="3"/> 上的字段 <x id="2"/> 与 <x id="5"/> 上的 <x id="4"/> 冲突

This should have matched the expected output from DeepL. However, this could not be reproduced.
The CURL command was used with request.json based on the following document in the test, and set the mime_type in the TranslateText request to "text/html" as suggested: (

https://cloud.google.com/translate/docs/advanced/translating-text-v3#curl). But after setting the mime_type to "text/html" in the request.json file, the returned translation was still fragmented and the tags were not being re-ordered.

报表<x id="1"/>对象链接无效，因为字段<x id="2"/>“”在“”上<x id="3"/>“”与“”冲突<x id="4"/>“”在“”上<x id="5"/>

Other information taken into consideration:

> Made sure only plain text content is included in the training of the models. Remove all HTML tags from the examples provided.
> For better performance, tried to ensure that the training examples are full sentences.
> Currently the HTML handling in Custom Translation is being upgraded to match the general NMT models and it's expected to have improved accuracy.

The other workaround uses {number} or similarly <number> format will ensure that the placeholders are tokenized into 1 word and not broken out. It is expected to have better quality in general than XML tags. Added characters within the <x/> nodes (e.g. something like <x id="{number}">{can be any single character}</x>).

The request was run on "general NMT" or non customized models with HTML mime type. However, the results are not always deterministic so even with the same model getting different results is expected.

These are best-effort workarounds and the API does not officially support placeholders.

IssueTracker