#2532 How to decode char data from OutStream?

SlimerDude Tue 19 Apr 2016

I have a system that writes chars to an OutStream, in effect it is doing this:

out := MyOutStream() 
out.writeChar('\$')
out.writeChar('£')
out.writeChar('€')

My aim is to simply print those chars to the screen.

The issue is that OutStream provides byte data and I want to print character data. In the example above, the 3 chars when UTF-8 encoded produce 5 bytes: 0x24 0xC2 0xE2 0x82 0xAC - the Euro symbol being responsible for the last 3 bytes.

I thought Fantom Charsets would be the way forward, but all their internals are locked down to native land and not accessible to Fantom code.

My next idea was to use a Buf to perform to the encoding:

class MyOutStream : OutStream {
    new make() : super(null) { }
    
    Buf buf := Buf()
    
    override This write(Int byte) {

        // write binary to the Buf
        buf.seek(buf.size).write(byte)

        // read characters from the Buf!
        ch := buf.seek(0).readChar

        // if successful, reset the Buf and print the char
        if (ch != null) {
            buf.clear
            printChar(ch)
        }

        return this
    }
    
    ** The goal!
    private Void printChar(Int char) {
        echo(char.toChar)
    }
}

But readChar() throws sys::IOErr: Invalid UTF-8 encoding when it has an incomplete UTF-8 sequence, i.e. on the first byte of the Euro symbol.

Any ideas?

brian Tue 19 Apr 2016

You have to do it in Java, by overriding this method:

public OutStream writeChar(char c)

Its not optimal, but all the I/O stuff works at the 32-bit level to provide high performance (vs upcasting everything to long and then back down to int)

SlimerDude Tue 19 Apr 2016

Thanks!

Shame it doesn't really work because the other (external) system is still calling write(Int byte) - and it would be really nice if there was a pure Fantom way of doing it.

I guess what I'm after is something like:

bytes --> BUFFER --> chars

For comparison, Java has the InputStreamReader class.

It seems to be the sort of thing that InStream / Buf should be able to handle, especially as they already have decoding / encoding Charsets.

In fact, the code I proposed does (mostly) work if you wrap in.readChar() in a try / catch.

The current issue with readChar() is that it doesn't distinguish between Invalid Sequence and Not Enough Data. When the first byte of the Euro symbol is read, that byte isn't particularly invalid, it just needs 2 more to create a valid char.

I was thinking that maybe the semantics of readChar() could be updated so that it returns null if EOS or not enough data to read a valid char (as they're kinda similar) and throw IOErr if invalid.

Or..

Maybe something like an extra Int availChars() method to supplement the existing avail() method? That would be ideal because value could then be used with readChars(Int n). Any partial char data at the end of the Buf would then be left un-read, until more stream data is added.

Only thinking it through, I don't see how availChars() could work without first decoding the entire Buf! So instead, how about Str readAvailChars() which would read all the valid chars it can, leaving un-read and intact any partial char data?

This may only be a small hole in Fantom's stream functionality, but as everything else it in place, it'd be really nice if it were patched up.

brian Tue 19 Apr 2016

I think a much simpler direction is to just create a special version of OutStream that takes the performance hit for each each char write. Something like a native class that does this:

public OutStream writeChar(char c) { sinkChar(c); }

/* This version can be overridden and handled in pure Fantom */
public OutStream sinkChar(long c) {}

Login or Signup to reply.