Assigned
Status Update
Comments
ro...@google.com <ro...@google.com> #2
I have forwarded this request to the engineering team. We will update this issue with any progress updates and a resolution.
Best Regards,
Josh Moyer
Google Cloud Platform Support
Best Regards,
Josh Moyer
Google Cloud Platform Support
wa...@gmail.com <wa...@gmail.com> #3
This is not only useful for IP addresses, but also for many other resources. I understand that names are currently used as identifiers, so this request is probably not trivial to implement. Maybe distinguishing between a (numeric, automatically generated) identifier and a (textual) label is the way to go?
Description
Please add a feature where we can specify characters to be searched in a document.
The issue:
I'm working with low-quality documents which can only have alphanumeric characters [A-Za-z0-9] and a few known symbols [/,'<-]. OCR sometimes misclassifies characters with illegal ones, for example:
S becomes one of Š$ⱾŞꞨ...; < becomes one of く^人>ぐ《«众...
This can not be easily reversed. For example, I know my document doesn't have "!" but I get an OCR result "! P!!!ow". I can't reliably correct it to "1 Pillow". Here maybe the confidence score for 1st character ! was .7 and score for 1 was .69, so it returned !. If I could specify the characters to look for in the image, it could increase accuracy and reduce OCR post-processing required.
What you would like to accomplish:
I would like to have an option to request ocr to only look for characters I provided in a list.
How this might work (2 possible ways):
When submitting an ocr request, specify a character set, just like you do with specifying a language. Ocr would only return characters from the set, even if other characters had higher confidence score.
Other solution would be to return top >0.1 predictions with their scores for each character from which we could get the characters we need. It would be enabled with a flag upon request, like you do with specifying a language. Return eg: "confidence": {"!": .7, "1": 0.69, "j": 0.3}.
Reasons why alternative solutions are not sufficient:
Current recommended solution is setting an OCR language. It is not sufficient because ORC still confuses letters with illegal symbols and numbers.
Other solution is post-processing by creating a dictionary to replace all diacritic characters with the closest character. But it is not reliable as shown above. Substitution example: "B"<- "BⒷBḂḄḆɃƂƁ", "C"<- "CⒸCĆĈĊČÇḈƇȻꜾ", "S"<- "SⓈSẞŚṤŜṠŠṦṢṨȘŞⱾŞꞨꞄ$".