Status Update
Comments
su...@google.com <su...@google.com> #2
We have shared this with our product and engineering team and will update this issue with more information as it becomes available.
en...@google.com <en...@google.com>
da...@google.com <da...@google.com>
da...@google.com <da...@google.com> #3
Thanks for the report. We'd apparently misread that part of the spec when we wrote this. Embarrassingly we do have a test for this, the test is just also wrong (and if I run the test against glibc, it rightly fails).
The fix for this specific problem is quite straightforward, but fixing that actually uncovers some other UTF-16 test failures. I need to spend some more time poking at them to figure out if that's a test bug (glibc is failing them, so I think they are) or if my fix is wrong.
Just skimming the C spec I think we're also missing a few tests (the "does not consume from input" part of the -3 return case at the very least).
br...@gmail.com <br...@gmail.com> #4
It's a good idea to also test on a musl libc system (e.g. Alpine Linux), aince musl libc does not have known bugs in this area.
en...@google.com <en...@google.com> #5
It's a good idea to also test on a musl libc system (e.g. Alpine Linux), aince musl libc does not have known bugs in this area.
are you implying "but glibc does"? how about macOS (since that's (a) usually easier to test and (b) more pertinent for "bug compatibility" for app developers [via iOS])?
da...@google.com <da...@google.com> #6
The fix for this specific problem is quite straightforward, but fixing that actually uncovers some other UTF-16 test failures.
This turned out to be a bug in my fix, and we had a (correct) test for that bug :)
I cleaned up all our uchar.h tests so we actually have them all passing on bionic, glibc, and musl
It's a good idea to also test on a musl libc system (e.g. Alpine Linux), aince musl libc does not have known bugs in this area.
are you implying "but glibc does"? how about macOS (since that's (a) usually easier to test and (b) more pertinent for "bug compatibility" for app developers [via iOS])?
I'm also curious about this, because aside from not being up to date on UTF-8 (glibc still accepts 5- and 6-byte sequences, which I suspect was a conscious decision), our tests found no bugs in glibc. If glibc has known bugs, we need more tests. That's not surprising though. At a glance it did look like someone could write a handful of missed tests if they sat down with the spec for a day or two.
Description
"If the function determines that the next multibyte character is complete and valid, it determines the values of the corresponding wide characters and then, if pc16 is not a null pointer, stores the value of the first (or only) such character in the object pointed to by pc16. Subsequent calls will store successive wide characters without consuming any additional input until all the characters have been stored."
A small positive return value indicates that "the next n or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character."
The return value (size_t)(-3) indicates that "the next character resulting from a previous call has been stored (no bytes from the input have been consumed by this call)".
So, when the input is a Unicode character outside the BMP, the first mbrtoc16() call should return the number of bytes (= 4), whereas the second mbrtoc16() call should return (size_t)(-3).
Test case:
========================= foo.c =========================
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <uchar.h>
int main ()
{
if (setlocale (LC_ALL, "en_US.UTF-8") == NULL)
return 1;
char input[] = "\360\237\230\213"; /* U+1F60B */
mbstate_t state;
char16_t wc;
size_t ret;
memset (&state, '\0', sizeof (mbstate_t));
wc = (char16_t) 0xBADF;
ret = mbrtoc16 (&wc, input, 4, &state);
printf ("First mbrtoc16 call: wc = 0x%04X, ret = %d\n", wc, (int) ret);
ret = mbrtoc16 (&wc, input + 4, 0, &state);
printf ("Second mbrtoc16 call: wc = 0x%04X, ret = %d\n", wc, (int) ret);
return 0;
}
========================================================
Compile and run this program (e.g. under termux).
Expected output (seen e.g. in musl libc, which is generally very standards compliant, and GNU libc):
First mbrtoc16 call: wc = 0xD83D, ret = 4
Second mbrtoc16 call: wc = 0xDE0B, ret = -3
Actual output (on Android 11):
First mbrtoc16 call: wc = 0xD83D, ret = -3
Second mbrtoc16 call: wc = 0xDE0B, ret = 4
Returning (size_t)(-3) in the first call is wrong, because that is not "the next character from a previous call".