Assigned
Status Update
Comments
ds...@google.com <ds...@google.com>
ds...@google.com <ds...@google.com> #2
I have forwarded this request to the engineering team. We will update this issue with any progress updates and a resolution.
Best Regards,
Josh Moyer
Google Cloud Platform Support
Best Regards,
Josh Moyer
Google Cloud Platform Support
Description
(I'm reposting this on the public tracker since we had no reaction to the post on the confidential one)
Problem you have encountered: Ever since the OCR model was updated, we noticed a strange regression which caused us some problems with our application's text processing.
Consider the attached image (it's a sample specimen of the Dutch passport). When running a image annotation request with TEXT_DETECTION feature, using the old model (builtin/legacy) we get the appropriate response (old_response.json). The first entity annotation of the response contains the full text of the image. You can find a fragment in the full text which says:
\nP NLD Nederlandse-\n
. When looking though the individual text annotations I can find the matching annotation which contains the description "Nederlandse-" (notice the dash character):After the model update, when using the latest model (new_response.json), the same request has a response where the individual annotations do not add up to the full text annotation: in the full text annotation I can still see the
\nP NLD Nederlandse-\n
(with the dash character) fragment, but when looking through the individual text annotation the matching annotation is missing the dash character, and the dash is not found in any other annotations so it's completely gone.What you expected to happen:
I expect that the image annotation responses with TEXT_DETECTION or DOCUMENT_TEXT_DETECTION feature s should have the first annotation element contain the full text of the image, and each following annotation elements should completely match the full text when added up.
Steps to reproduce:
Alternatively
I'd like to add that this issue is currently happening to quite a lot of files (dash before new lines is present in the full text, but stripped in the individual element annotations), but this is the only non sensitive file I have found so far.