PDF Text recognition does not offer TOP DOWN scanning [154751409]

Assigned

Feature Request

Status Update

No update yet.

Description

so...@icsl.es

created issue #1

Apr 23, 2020 11:56AM

Hello,

All pdf's sent to google vision are not read in top-down format but left-to-right in block.
That makes nearly impossible a properly parsing of the output.

Nearly all other market solutions (event some free ones) offers the possibility of keeping the original format, even with tables.

Really a needed feature for serious pdf to text processing.

Thanks.

Comments

ci...@google.com <ci...@google.com> #2Apr 29, 2020 07:46AM

Assigned to gc...@google.com.

Hello,

It seems that the output of vision API while parsing the text of a PDF file already provides the text from top to bottom as described in [1].

Could you please include a specific example of your issue (an attach an example PDF that generates undesired output) so I can better understand your problem and help you with?

--------------------
[1]

https://cloud.google.com/vision/docs/pdf

so...@icsl.es <so...@icsl.es> #3Apr 29, 2020 12:59PM

Hello,

Find enclosed the sample. PDF, the output json and the .txt I generate by just iterating sequentally the JSON.
I know vision generates the coords for each letter, but it should be straight forward to extract the text in the right order just based on the JSON position.

Regards,

Josep

ocr_output_test-ocr_output-1-to-1.json

20 KB

Download

test-ocr.pdf

35 KB

View

Download

test-ocr.txt

86 B

View

Download

ci...@google.com <ci...@google.com> #4Apr 29, 2020 02:18PM

Hello,

Thank you for providing this. I was able to reproduce your issue and I will forward all this information to the Product Engineering team. Please, bear in mind that there is no ETA for this, but any update on this will be posted here. Consider starring this issue for getting automatic notifications [1]. Feature Requests stared by a higher number of users are more likely to be implemented [2].

--------------------
[1]

https://developers.google.com/issue-tracker/guides/subscribe#starring_an_issue
[2]

https://cloud.google.com/support/docs/issue-trackers#what_to_expect_once_youve_opened_an_issue

Message last modified on Apr 29, 2020 04:01PM

Issue 154751409

Description

Issue summary

Comments

ci...@google.com <ci...@google.com> #2Apr 29, 2020 07:46AM

so...@icsl.es <so...@icsl.es> #3Apr 29, 2020 12:59PM

ci...@google.com <ci...@google.com> #4Apr 29, 2020 02:18PM

Add comment

Issue metadata