Assigned
Status Update
Comments
se...@google.com <se...@google.com> #2
I have informed our engineering team of this feature request. There is currently no ETA for its implementation.
A current workaround would be to check the returned "boundingPoly" [1] "vertices" for the returned "textAnnotations". If the calculated rectangle's heights > widths, than your image is sideways.
[1]https://cloud.google.com/vision/reference/rest/v1/images/annotate#boundingpoly
A current workaround would be to check the returned "boundingPoly" [1] "vertices" for the returned "textAnnotations". If the calculated rectangle's heights > widths, than your image is sideways.
[1]
[Deleted User] <[Deleted User]> #3
I also need this problem solved :)
[Deleted User] <[Deleted User]> #4
same :D
[Deleted User] <[Deleted User]> #6
+1
se...@google.com <se...@google.com>
lu...@gmail.com <lu...@gmail.com> #7
This needs more attention. It's not just a display issue as described in the report. The co-ordinates returned in 'boundingPoly' are incorrect if the image was taken on a phone. All the x points should be y and vice versa.
The workaround does not make sense as "boundingPoly" [1] "vertices" for "textAnnotations" does not indicate the image dimensions - it indicates the dimensions of the relevant text block inside the image.
The workaround does not make sense as "boundingPoly" [1] "vertices" for "textAnnotations" does not indicate the image dimensions - it indicates the dimensions of the relevant text block inside the image.
Description
1 Speaker diarization results from GCP for videos with multiple speakers (any number between 2 and 15) always says only 2 or 3 speakers.
2 Though the number of speakers predicted are only 2, we get speaker tag output from GCP as 1 and 3 (instead of expected tags 1 and 2).
Steps to Reproduce:
1 Take any video (preferably a zoom recording) with multiple speakers (preferably between 4 and 10) and convert it into wav file using FFMPEG library with the below mentioned attributes:
16K sample rate
pcms16le audio codec
output audio channels = 1 and vn for output
map metadata -1
2 Then that wav file is input to longrunningrecognize of GCP's speech to text v1p1beta1 library along with config's mentioned below:
enableSpeakerDiarization: true,
originalMediaType: google.cloud.speech.v1.RecognitionMetadata.OriginalMediaType.VIDEO,
interactionType:
google.cloud.speech.v1.RecognitionMetadata.InteractionType
.PHONE_CALL,
const encoding = google.cloud.speech.v1.RecognitionConfig.AudioEncoding.LINEAR16;
const languageCode = 'en-US';
const sampleRateHertz = 16000;
enableWordTimeOffsets: true,
enableAutomaticPunctuation: true,
maxAlternatives: 1,
profanityFilter: true,
model: 'video',
3 Expected Behaviour:
1 Speaker diarization is expected to accurately identify the number of speakers and corresponding utterances, not say 2 or 3 speakers for any number of speakers.
2 Speaker tags are supposed to be continous numbers starting from 1 to N-1, where N=predicted number of speakers. Not say number of speakers=2 and then give speaker tags as 1 and 3.