All Topics

#2491 UTF-8 Encoding

SlimerDude Tue 10 Nov 2015

Whilst writing the UTF-8 percent escaper in the last post I was unfortunate enough to peek into the brain-numbing nebulous headache that is character encoding. Still, I was concious enough to spot this...

Going by the normative definitions of UTF-8 in RFC3629 and Wikipedia it would seem that UTF-8 code points may be in the range of 0x000000 -> 0x10FFFF.

Whereas Fantom's Java code base seems to only recognise UTF-8 code points in the range 0x0000 -> 0xFFFF - as seen in:

fan.sys.Charset.Utf8Encoder
fan.sys.Charset.Utf8Decoder
fan.sys.Uri.percentEncodeChar()

I was just wondering why that is?

I also noted that fan.sys.FanInt.toChar() only recognises Unicode chars in the range 0x0000 -> 0xFFFF.

KevinKelley Tue 10 Nov 2015

I wondered about that before as well; I think it's because Java's internal representation of a Char is 2 bytes (UTF-16, Basic Multilingual Plane), likely because at the time 65K chars seemed like a lot. But Unicode kept growing...

Anyway since a java string can only hold 16-bit chars, the upper planes can't be directly decoded into array-of-char. Probably the decoder should be throwing "unicode is hard and java is weak!" exceptions there...maybe better would be to not fail or throw, but substitute a unicode-unrecognized U+FFFD, � char instead...

Java Character class javadoc

brian Tue 10 Nov 2015

Yeah its really a Java thing. They originally didn't support anything but 16-bit chars in strings. And that is still how Java works best. Java does support Unicode values above 0xFFFF using something called supplementary characters / surrogates. Its complicated, and I haven't ever really taken the time to understand the performance of using some of these methods for Fantom. It used to be that there wasn't any practical reason for supporting the higher Unicode planes. But emoticons are higher than 0xFFFF and probably the chars that we will want to support at some time.

The good news is that from the start, all character representation in Fantom uses a 64-bit integer (long in Java), so we should be future proof in the APIs. Its really just an internal detail how we map String/char support in Java runtime.

SlimerDude Tue 10 Nov 2015

Cool, that's fine for converting to and from native strings, no need to introduce supplementary character surrogates just yet!

But I think fan.sys.Charset.Utf8Encoder / Utf8Decoder and others should still be updated to handle the full UTF-8 range. That's because it's used by Buf and Streams which concentrate on byte data. Outstream.writeChar(Int char), for instance, has no such Java limitation but still suffers from a 0xFFFF limit.

SlimerDude Thu 1 Mar 2018

Spent the day tracking down a problem with a JSON response from Microsoft's Bing Web Search API, part of their Azure Cognitive Services offering.

The issue was that their JSON was UTF-8 encoded and contained characters from the extended 0x10000 - 0x10FFFF number range - emoji's from various web pages; which as mentioned previously in this thread, can not be represented by a Java char.

To be a lot more correct and to help solve future problems, Fantom's UTF-8 Decoder should really be throwing this:

diff -r c0eeb29f20c6 src/sys/java/fan/sys/Charset.java
--- a/src/sys/java/fan/sys/Charset.java	Wed Feb 28 09:11:29 2018 -0500
+++ b/src/sys/java/fan/sys/Charset.java	Thu Mar 01 18:17:36 2018 +0000
@@ -139,7 +139,7 @@
-          throw IOErr.make("Invalid UTF-8 encoding");
+          throw UnsupportedErr.make("Unsupported UTF-8 encoding");

But I would much rather Fantom followed @KevinKelleys suggestion of returning the Unicode replacement character for all unrepresentable characters; as this would still allow the otherwise valid UTF-8 strings to be read and decoded.

diff -r c0eeb29f20c6 src/sys/java/fan/sys/Charset.java
--- a/src/sys/java/fan/sys/Charset.java	Wed Feb 28 09:11:29 2018 -0500
+++ b/src/sys/java/fan/sys/Charset.java	Thu Mar 01 18:50:42 2018 +0000
@@ -119,7 +119,7 @@
     {
       int c = in.r();
       if (c < 0) return -1;
-      int c2, c3;
+      int c2, c3, c4;
       switch (c >> 4)
       {
         case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
@@ -139,7 +139,16 @@
             throw IOErr.make("Invalid UTF-8 encoding");
           return (((c & 0x0F) << 12) | ((c2 & 0x3F) << 6) | ((c3 & 0x3F) << 0));
         default:
-          throw IOErr.make("Invalid UTF-8 encoding");
+          /* 1111 0xxx  10xx xxxx  10xx xxxx  10xx xxxx */
+          c2 = in.r();
+          c3 = in.r();
+          c4 = in.r();
+          if (((c2 & 0xC0) != 0x80) || ((c3 & 0xC0) != 0x80) || ((c4 & 0xC0) != 0x80))
+            throw IOErr.make("Invalid UTF-8 encoding");
+          // Java can't handle chars in this upper / extended range
+          // so return a replacement character instead
+          // see https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
+          return 0xFFFD;
       }
     }
   }

KevinKelley Thu 1 Mar 2018

+1, still think that's best idea.

brian Fri 2 Mar 2018

I pushed your suggested change Steve - at least then you can decode valid UTF-8 even if you can't represent it cleanly in Java (without using surrogates)