Hindi PDF scan to text problems [329514366]

Assigned

Feature Request

Status Update

No update yet.

Description

an...@google.com

created issue #1

Mar 13, 2024 11:47PM

What you would like to accomplish:
-Fix issues or improve OCR quality for Document AI for the Hindi language.

Issues Encountered:
-Some characters are replaced.
-Some characters are inserted between new lines (e.g. hyphen, period, single/double quotes or random characters)
-The vertical lines, which is the same as a period (full stop) in English is recognized as the number 1.
-Double quotes are detected as different characters (different bytes).
-Extra spaces are created.
-All dashes can be detected as the same dash lines in English
-Some characters are not detected correctly.

How this might work:
-Hindi characters should be recognized correctly.

If applicable, reasons why alternative solutions are not sufficient:
-Current alternative is only for short-term resolution.

Other information (workarounds you have tried, documentation consulted, etc):
-Workaround applied is to use the model builtin/weekly as it shows lesser issues/inconsistencies on the result.

IssueTracker