[aprssig] Please, standardize UTF-8 for APRS
Heikki Hannikainen
hessu at hes.iki.fi
Mon Sep 21 16:20:36 EDT 2009
On Mon, 21 Sep 2009, Stephen H. Smith wrote:
> Joel Maslak - N7XUC wrote:
>> I am all for this. One caution - we need to define max message length I'm
>> BYTES not characters!
>>
>> On Sep 21, 2009, at 8:24 AM, Heikki Hannikainen <hessu at hes.iki.fi> wrote:
>>
>>> UTF-8 would be very, very good for everyone involved (including Sergej),
>>> even if your particular character set happens to fit within 8 bits.
>
> Is this going to be practical AT ALL within the existing APRS framework?
Yes. It does take more than one byte per character, if you send
non-english text, but if that language uses mostly ASCII characters (many
languages only use a small set of extra letters), the ASCII characters are
still the same, one byte per character.
> Currently the APRS message string length is limited to 60-70 bytes (i.e.
> classic APRS ASCII characters). If one starts using UTF-8, you may be
> using 3-4-5 bytes per character, reducing the effective length of a message
> to only 15-20 non-western characters. Is this going to be enough be useful
> for anything?
With UTF-8, 1 to 3 bytes per character in practice. There's a hard limit
of 4 bytes per char in UTF-8, but practically everything fits in 3.
Japanese and Chinese (I suppose) characters take 3 bytes for most or all
characters, but it isn't so much of a problem, since all common words have
their own character. Think about it: A single character represents a word.
"A guideline created by the Japanese Ministry of Education, the list of
kyōiku kanji ("education kanji", a subset of jōyō kanji), specifies the
1,006 simple characters a child is to learn by the end of sixth grade.
Children continue to study another 939 characters in junior high school,
covering in total 1,945 jōyō kanji." (Wikipedia: Japanese language)
In Finnish, we only have a couple of extra letters in use (åäö ÅÄÖ), and
those characters take 2 bytes in UTF-8. The rest of our letters are ASCII,
and take 1 byte per character.
I though the 67-byte limit was only present in old hardware which does not
support UTF-8 or non-ascii characters anyway, and that they could be
omitted from this discussion. Once a hardware manufacturer puts in enough
memory to store the additional fonts for the new characters (2 megs of
flash was Matti's guess), they can easily put in a few bytes of memory to
support longer text messages.
Now that I re-read the APRS specification, it says:
"The message text may be up to 67 characters long, and may contain any
printable ASCII characters except |, ~ or {."
Please note that it says *characters*, not *bytes*. :)
But OK, I suppose the authors meant bytes when they wrote characters,
since they limited it to ASCII.
Once the definition is changed to support UTF-8, it can be changed to
support longer messages.
> Consider especially an APRS-to-email message. By the
> time you express the "name at server.domain" email address, there will
> practically NO space for the message.
Wrong. With UTF-8, the first 128 characters are equal to ASCII. Those
characters in the email address are still 8 bits, 1 byte, per character.
name at server.domain takes just as many bytes in UTF-8 as in ASCII, because
they're transmitted exactly the same way. 'name' is equal to 'name', and @
still takes 1 byte, because it's encoded the same in both ASCII and UTF-8:
64 in decimal, 0x40 in hex, 01000000 in bits.
When you start to support additinal characters, you'll inevitably need
more bits per character. I'm going to quote the Wikipedia article a bit:
"The UTF-8 encoding is variable-width, ranging from 1-4 bytes. Each byte
has 0-4 leading 1 bits followed by a zero bit to indicate its type. N 1
bits indicates the first byte in a N-byte sequence, with the exception
that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a
continuation byte in a multi-byte sequence (this was done for ASCII
compatibility).
[a nice table omitted]
So the first 128 characters (US-ASCII) need one byte. The next 1,920
characters need two bytes to encode. This includes Latin letters with
diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew,
Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of
the Basic Multilingual Plane (which contains virtually all characters in
common use). Four bytes are needed for characters in the other planes of
Unicode, which include less common CJK characters and various historic
scripts."
http://en.wikipedia.org/wiki/UTF-8#Description
> Not to mention the difficulty of entering these on DTMF pads on Kenwood
> radios. (Is it even possible ?). [Not that I really care. I think
> entering even US ASCII text on a cellphone numeric keyboard is a perverse
> masochistic exercise. I refuse to do any kind of text comms on less than a
> full computer keyboard!]
> It would appear that until the Kenwood radios fade from the APRS scene that
> it will be really difficult to incorporate UTF-anything into the present
> patchwork "kludge" of improvised workarounds and expedients for the
> limitations of old packet hardware.
You're right - you won't be able to send or receive Japanese or
Scandinavian characters using the old Kenwoods. It'll never be possible
since their firmware won't be fixed to support this. And they probably
wouldn't have enough memory for the fonts. [Not that I care at all.]
But is that a reason to *not* add support for additional languages in the
protocol, so that any *new* and *future* software releases could support
it? Especially when the change is backwards compatible, so that the old
and new devices can still talk to each other, without any configuration
changes, when English text is transmitted?
> Any extension that could incorporate UTF-8, longer text strings, more
> robust encoding and error correction, etc will be so utterly different
> from the present APRS protocols that I don't think it should be called
> "APRS" at all.
I do not think so, adding support for UTF-8 isn't actually very difficult,
because UTF-8 is designed to be backwards compatible with ASCII. In this
thread, I'm only arguing about UTF-8, not about any major changes like
robust encoding or error correction.
Hey, I'm sending this very message from an UTF-8 enabled system, and you
can still read it! It really is backwards compatible! 日本語 (Japanese)
may not work for you, and I don't understand it either, so we don't care
about that part. But while I'm sending in UTF-8 and you're reading in
ASCII, we can still understand each other because your ASCII-only
system can show my english UTF-8 text.
- Hessu
More information about the aprssig
mailing list