Native crash when decoding XML entities representing four-byte UTF-8 characters [37013126]

Fixed

Bug

[AOSP] duplicate

[AOSP] FutureRelease

Status Update

No update yet.

Description

ch...@orr.me.uk

created issue #1

Dec 1, 2014 10:49AM

When XML character entities in an Android XML file (e.g. string definitions, or an "android:text" value) decode to a four-byte UTF-8 representation, the virtual machine crashes with the error "input is not valid Modified UTF-8".

I first observed this with Emoji (e.g. 😃 for a smiling face), but it seems any Unicode code point >= 0x10000 causes this crash since these are four-byte UTF-8 characters, which always begin with a value between 0xf0 and 0xf4, which seems to upset the NewStringUTF function.

Testing with a high three-byte UTF-8 character, e.g. 0xfffd (�) works fine.

Example of strings.xml content which crashes:
<string-array name="word_list_good">

<item>😃</item>

</string-array>

Dalvik stacktrace (on an Android Wear 4.4W2 emulator):

W/dalvikvm( 1934): JNI WARNING: NewStringUTF input is not valid Modified UTF-8: illegal start byte 0xf0
W/dalvikvm( 1934): string: '😃'
W/dalvikvm( 1934): in Landroid/content/res/AssetManager;.getArrayStringResource:(I)[Ljava/lang/String; (NewStringUTF)
I/dalvikvm( 1934): "main" prio=5 tid=1 NATIVE
I/dalvikvm( 1934): | group="main" sCount=0 dsCount=0 obj=0xb2ddcda0 self=0xb8db8480
I/dalvikvm( 1934): | sysTid=1934 nice=0 sched=0/0 cgrp=[fopen-error:2] handle=-1216638336
I/dalvikvm( 1934): | state=R schedstat=( 0 0 0 ) utm=0 stm=0 core=0
I/dalvikvm( 1934): #00 pc 000019e5 /system/lib/libcorkscrew.so (unwind_backtrace+101)
I/dalvikvm( 1934): #01 pc 00008131 /system/lib/libbacktrace.so (CorkscrewCurrent::Unwind(unsigned int)+49)
I/dalvikvm( 1934): #02 pc 000028c9 /system/lib/libbacktrace.so (Backtrace::Unwind(unsigned int)+25)
I/dalvikvm( 1934): #03 pc 000b7c61 /system/lib/libdvm.so (dvmDumpNativeStack(DebugOutputTarget const*, int)+81)
I/dalvikvm( 1934): #04 pc 000954a8 /system/lib/libdvm.so (dvmDumpThreadEx(DebugOutputTarget const*, Thread*, bool)+1512)
I/dalvikvm( 1934): #05 pc 0009568b /system/lib/libdvm.so (dvmDumpThread(Thread*, bool)+75)
I/dalvikvm( 1934): #06 pc 0004beb3 /system/lib/libdvm.so
I/dalvikvm( 1934): #07 pc 0004dcdd /system/lib/libdvm.so (ScopedCheck::check(bool, char const*, ...)+1853)
I/dalvikvm( 1934): #08 pc 0005269a /system/lib/libdvm.so
I/dalvikvm( 1934): #09 pc 000b7fec /system/lib/libandroid_runtime.so
I/dalvikvm( 1934): #10 pc 0002b3de /system/lib/libdvm.so (dvmPlatformInvoke+82)
I/dalvikvm( 1934): at android.content.res.AssetManager.getArrayStringResource(Native Method)
I/dalvikvm( 1934): at android.content.res.AssetManager.getResourceStringArray(AssetManager.java:186)
I/dalvikvm( 1934): at android.content.res.Resources.getStringArray(Resources.java:468)

Or in a layout XML file:
<TextView
android:id="@+id/happy"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="😃" />

ART stacktrace (on a Nexus 5 with Android 5.0):
art/runtime/check_jni.cc:65] JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8: illegal start byte 0xf0
art/runtime/check_jni.cc:65] string: '😃'
art/runtime/check_jni.cc:65] in call to NewStringUTF
art/runtime/check_jni.cc:65] from java.lang.String android.content.res.StringBlock.nativeGetString(long, int)
art/runtime/check_jni.cc:65] "main" prio=5 tid=1 Runnable
art/runtime/check_jni.cc:65] | group="main" sCount=0 dsCount=0 obj=0x737fdec0 self=0xb4f07800
art/runtime/check_jni.cc:65] | sysTid=2892 nice=0 cgrp=apps sched=0/0 handle=0xb6f12ec8
art/runtime/check_jni.cc:65] | state=R schedstat=( 571393690 122086422 592 ) utm=50 stm=7 core=1 HZ=100
art/runtime/check_jni.cc:65] | stack=0xbe0d2000-0xbe0d4000 stackSize=8MB
art/runtime/check_jni.cc:65] | held mutexes= "mutator lock"(shared held)
art/runtime/check_jni.cc:65] native: #00 pc 00004c58 /system/lib/libbacktrace_libc++.so (UnwindCurrent::Unwind(unsigned int, ucontext*)+23)
art/runtime/check_jni.cc:65] native: #01 pc 000034c1 /system/lib/libbacktrace_libc++.so (Backtrace::Unwind(unsigned int, ucontext*)+8)
art/runtime/check_jni.cc:65] native: #02 pc 0025918d /system/lib/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, char const*, art::mirror::ArtMethod*)+84)
art/runtime/check_jni.cc:65] native: #03 pc 0023cd13 /system/lib/libart.so (art::Thread::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&) const+162)
art/runtime/check_jni.cc:65] native: #04 pc 000b1195 /system/lib/libart.so (art::JniAbort(char const*, char const*)+620)
art/runtime/check_jni.cc:65] native: #05 pc 000b18c5 /system/lib/libart.so (art::JniAbortF(char const*, char const*, ...)+68)
art/runtime/check_jni.cc:65] native: #06 pc 000b3e63 /system/lib/libart.so (art::ScopedCheck::Check(bool, char const*, ...) (.constprop.128)+922)
art/runtime/check_jni.cc:65] native: #07 pc 000bd965 /system/lib/libart.so (art::CheckJNI::NewStringUTF(_JNIEnv*, char const*)+44)
art/runtime/check_jni.cc:65] native: #08 pc 00087f97 /system/lib/libandroid_runtime.so (???)
art/runtime/check_jni.cc:65] native: #09 pc 002599a7 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_content_res_StringBlock_nativeGetString__JI+102)
art/runtime/check_jni.cc:65] at android.content.res.StringBlock.nativeGetString(Native method)
art/runtime/check_jni.cc:65] at android.content.res.StringBlock.get(StringBlock.java:82)
art/runtime/check_jni.cc:65] - locked <0x125b223b> (a android.content.res.StringBlock)
art/runtime/check_jni.cc:65] at android.content.res.XmlBlock$Parser.getPooledString(XmlBlock.java:458)
art/runtime/check_jni.cc:65] at android.content.res.TypedArray.loadStringValueAt(TypedArray.java:967)
art/runtime/check_jni.cc:65] at android.content.res.TypedArray.getText(TypedArray.java:144)
art/runtime/check_jni.cc:65] at android.widget.TextView.<init>(TextView.java:917)

Comments

na...@google.com <na...@google.com> Dec 1, 2014 10:55AM

Assigned to na...@google.com.

en...@google.com <en...@google.com> #2Dec 2, 2014 04:10AM

to expand slightly, i think there are two choices:

1. change the VM to not require surrogates.

2. change all code calling NewStringUTF that might need to deal with emoji to convert to surrogates first.

this wasn't a problem in the past because non-BMP just wasn't relevant, but emoji have really changed that. the older i get, the more i wonder whether we should change the VM.

that said, as we see with how easy it is to crash Settings by switching to Arabic, most code that uses NewStringUTF is pretty suspect to start with. an audit/rewrite would probably be a good thing. i'm just not sure how practical that is. (and i never did go back and fix that Settings crash...)

na...@google.com <na...@google.com> #3Dec 2, 2014 11:52AM

To expand on 2 slightly, the VM is right that the sequence isn't modified UTF-8. These code points are supposed to be encoded as 2 x 3 byte surrogate pairs.

That said, I was thinking of modifying the VM to accept 4 byte utf-8 sequences and convert them into utf-16 surrogate pairs. It's bound to be tricky but it will probably make life a lot easier for apps that are treating mutf-8 as "null terminated UTF-8 over UCS-2 - {0}". We'll have to be stricter about overly long encodings though

My only worry is that we're introducing yet another pseudo-encoding :( . If we do this, we'll have to go delete the line on the wiki UTF-8 article that says "All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8"

en...@google.com <en...@google.com> #4Dec 2, 2014 05:41PM

sgtm

na...@google.com <na...@google.com> #8Apr 8, 2015 02:32PM

Marked as fixed.

https://android-review.googlesource.com/130121

mr...@gmail.com <mr...@gmail.com> #9Feb 13, 2016 03:17PM

A change the VM could use, related to this, is a char32 type, or uchar (for UCS4 char) as an all alpha label. Ustring wluld be the complement to String. Then if source is char16 the 6-byte surrogates get used going to UTF8, and with char32 4-byte form is output for non-BMP code points. Going the other way, utf8 to char16 or char32, whether it's 4-byte UTF8 or a 6-byte pair the target width determines use UTF16 or UTF32. An invalid pair gets stored as separate char32 values, not converted, or optionally throws an exception. I doubt adding bytecodes is even necessary, just a type letter to the class record, as the int ops can be overloaded. The changes to the compiler to add the new type keywords should be similarly minimal.

It's a thought, anyways.

ph...@gmail.com <ph...@gmail.com> #10Mar 11, 2016 09:15AM

You can refer to this solution(i have tried and it works as expected):

https://github.com/tnt-medallia/android-database-sqlcipher/commit/2c46889df16a2b7478076f8719dfe6929fdfa9b6

ra...@gmail.com <ra...@gmail.com> #11Jun 11, 2016 07:25AM

how to resolve this error

[Deleted User] <[Deleted User]> #12Aug 26, 2016 01:30AM

The workaround (for older phones) appears to be not putting the Emoji into your XML, but instead defining it in code.

So you can do something like this:
<string name="hooray">Hooray! %1$s</string>

Then in code:
final String PACKAGE_EMOJI = "\uD83C\uDF81";

getString(R.string.hooray, PACKAGE_EMOJI);

[Deleted User] <[Deleted User]> #13Mar 20, 2018 08:19AM

When J2V8 calls NewStringUTF with a string which contains emoji, very bad things happen on some
versions of Android:

On KitKat 4.4 And Lollipop 5.1.1, the converted string contains garbage characters rather than the original emoji.
On Lollipop 5.0.2, it crashes. This is not a normal exception which can be caught in Java; it actually kills the VM. (See error log below.)
On Marshmallow it appears to work fine.

#12
it looks like the only solution is to add in code