Post-HITL Document Naming is Unclear [288853311]

Assigned

Bug

Status Update

No update yet.

Description

th...@foreflight.com

created issue #1

Jun 26, 2023 03:22PM

Problem you have encountered:

Processing a document through our custom document extractor (from a local script) results in a new folder being added to our output directory. This folder has a long name of ~20 numbers. It contains a folder [0...n] for each document process. Each of these folders contains the json output from the processor. A sample path would be output-directory/12345678901234567890/0/filename-0.json.

The process for finding post-Human-in-the-Loop JSONs is much less clear. After processing a document in the same manner from above and letting it trigger human review (either with a known bad document or setting the document confidence threshold to 100%), our document shows up in the Specialist portal. We adjust the annotations accordingly and press Submit. This appears to populate a ~20-character folder in our HITL output directory, which contains a ~20 character JSON. Both the folder name and JSON name appear unrelated to the original filename. The JSON also contains a lot of fields not included in the original output JSON, but I'm inclined to believe that may be a separate issue. A sample path here would be hitl/12345678901234567890/98765432109876543210.json.

I can't see how it's possible to match the post-HITL JSON files up to the post-processor pre-HITL JSON files.

What you expected to happen:

I expect the post-HITL JSON files to have a filename (or other metadata) that can be used to match these files to the pre-HITL files. Otherwise I can't see how HITL is a useful feature.

Steps to reproduce:

Create a custom document extractor
Provide training data and train it accordingly
Set up HITL
Use the either the HITL GUI or command line to upload a document
Review that document through the specialist portal
Wait for the document to appear in your HITL output folder.

Other information (workarounds you have tried, documentation consulted, etc):

We've tried searching the contents of the post-HITL JSON files for identifying information -- this is unreliable, especially if documents have similar names.
Are the names of these folders and JSONs random? It's unclear.

I previously opened a ticket here: https://issuetracker.google.com/issues/287924956. I wasn't able to respond in time, but I am happy to provide additional information in a private thread.

Comments

ds...@google.com <ds...@google.com> Jun 27, 2023 05:48AM

Assigned to ds...@google.com.

ds...@google.com <ds...@google.com> #2Jun 28, 2023 06:03AM

Hello,

To assist us in conducting thorough investigation, we kindly request your cooperation in providing the following information regarding the reported issue:

Has this scenario ever worked as expected in the past?
Do you see this issue constantly or intermittently ?
If this issue is seen intermittently, then how often do you observe this issue ? Is there any specific scenario or time at which this issue is observed ?
To help us understand the issue better, please provide detailed steps to reliably reproduce the problem.
It would be greatly helpful if you could attach screenshots of the output related to this issue.

Your cooperation in providing these details will enable us to dive deeper into the matter and work towards a prompt resolution. We appreciate your assistance and look forward to resolving this issue for you.

Thank you for your understanding and cooperation.

ds...@google.com <ds...@google.com> Aug 31, 2023 04:43PM

Reassigned to gc...@google.com.

Issue 288853311