Limit OCR character set [180413376]

Assigned

Feature Request

Status Update

No update yet.

Description

ju...@gmail.com

created issue #1

Feb 17, 2021 03:13PM

Please add a feature where we can specify characters to be searched in a document.

The issue:

I'm working with low-quality documents which can only have alphanumeric characters [A-Za-z0-9] and a few known symbols [/,'<-]. OCR sometimes misclassifies characters with illegal ones, for example:

S becomes one of Š$ⱾŞꞨ...; < becomes one of く^人>ぐ《«众...

This can not be easily reversed. For example, I know my document doesn't have "!" but I get an OCR result "! P!!!ow". I can't reliably correct it to "1 Pillow". Here maybe the confidence score for 1st character ! was .7 and score for 1 was .69, so it returned !. If I could specify the characters to look for in the image, it could increase accuracy and reduce OCR post-processing required.

What you would like to accomplish:

I would like to have an option to request ocr to only look for characters I provided in a list.

How this might work (2 possible ways):

When submitting an ocr request, specify a character set, just like you do with specifying a language. Ocr would only return characters from the set, even if other characters had higher confidence score.
Other solution would be to return top >0.1 predictions with their scores for each character from which we could get the characters we need. It would be enabled with a flag upon request, like you do with specifying a language. Return eg: "confidence": {"!": .7, "1": 0.69, "j": 0.3}.

Reasons why alternative solutions are not sufficient:

Current recommended solution is setting an OCR language. It is not sufficient because ORC still confuses letters with illegal symbols and numbers.
Other solution is post-processing by creating a dictionary to replace all diacritic characters with the closest character. But it is not reliable as shown above. Substitution example: "B"<- "BⒷＢḂḄḆɃƂƁ", "C"<- "CⒸＣĆĈĊČÇḈƇȻꜾ", "S"<- "SⓈＳẞŚṤŜṠŠṦṢṨȘŞⱾŞꞨꞄ$".

Comments

ro...@google.com <ro...@google.com> #2Feb 18, 2021 10:53AM

Assigned to gc...@google.com.

I have forwarded this request to the engineering team. We will update this issue with any progress updates and a resolution.

Best Regards,
Josh Moyer
Google Cloud Platform Support

wa...@gmail.com <wa...@gmail.com> #3Jul 28, 2022 08:07AM

This is not only useful for IP addresses, but also for many other resources. I understand that names are currently used as identifiers, so this request is probably not trivial to implement. Maybe distinguishing between a (numeric, automatically generated) identifier and a (textual) label is the way to go?

Issue 180413376

Description

Issue summary

Comments

ro...@google.com <ro...@google.com> #2Feb 18, 2021 10:53AM

wa...@gmail.com <wa...@gmail.com> #3Jul 28, 2022 08:07AM

Add comment

Issue metadata