alternative to Unicode

From: <songdog!roman_at_nospam.org>
Date: Fri Sep 30 1994 - 22:50:40 PDT

I noticed the mention of Unicode and wide character support here a while
back, and I thought people might be interested in an alternative I came
across recently. P.J. Plauger writes a column for "Embedded Systems
Programming." In the May 1994 issue he discusses an encoding for
multibyte characters which has been developed at AT&T Bell Labs. It is
called FSS-UTF; he says the "FSS" part stands for "file-system safe" and
the "UTF" might mean something like "universal transfer format."

The "file-system safe" part has to do with the fact that certain
characters are treated specially when they appear in pathnames;
conventional multi-byte encodings could, for example, produce a '/' or a
'\0' as one byte of the two-byte representation of a Kanji character,
with undesireable consequences. FSS-UTF gets around this by
representing the 7-bit ASCII characters as themselves, and never using
these values as part of a multi-byte sequence. All bytes of a
multi-byte sequence have their high bit set.

Here is the full description of the encoding quoted from the article:

    A byte with a leading 0 bit (hex 00 through hex 7F) stands for a
    wide character with the same value, in the range 00-7F.

    A byte with a leading 10 (80-BF) is never the first byte of a
    multibyte sequence. Such an NFB contributes its six low-order bits
    to a wide character, as specified below.

    A byte with a leading 110 (C0-DF) is the first of a 2-byte sequence.
    It contributes its five low-order bits to a wide character in the
    range 40-7FF, and a following NFB contributes the low-order six
    bits.

    A byte with a leading 1110 (E0-EF) is the first of a 3-byte
    sequence. It contributes its four low-order bits to a wide
    character in the range 1000-FFFF, and the two following NFBs
    contribute the low-order 12 bits, in descending order of
    significance.

Plauger goes on to describe obvious extensions which can encode the ISO
10646 31-bit (!) character set within six bytes. He also points out
FSS-UTF could be used, for example, to write integer values into a byte
stream in a system-independent fashion, while also providing some amount
of compression for smaller values.

-- 
Bill Roman  (songdog!roman@eskimo.com / roman@songdog.uucp)   running linux
Received on Sat Oct 1 15:50:44 1994

This archive was generated by hypermail 2.1.8 : Thu Sep 22 2005 - 15:11:46 PDT