[aprssig] Please, standardize UTF-8 for APRS

Mon Sep 21 16:20:36 EDT 2009

On Mon, 21 Sep 2009, Stephen H. Smith wrote:

> Joel Maslak - N7XUC wrote:
>> I am all for this.  One caution - we need to define max message length I'm 
>> BYTES not characters!
>> 
>> On Sep 21, 2009, at 8:24 AM, Heikki Hannikainen <hessu at hes.iki.fi> wrote:
>> 
>>> UTF-8 would be very, very good for everyone involved (including Sergej), 
>>> even if your particular character set happens to fit within 8 bits.
>
> Is this going to be practical AT ALL  within the existing APRS framework?

Yes. It does take more than one byte per character, if you send 
non-english text, but if that language uses mostly ASCII characters (many 
languages only use a small set of extra letters), the ASCII characters are 
still the same, one byte per character.

> Currently the APRS message string length is limited to 60-70 bytes (i.e. 
> classic APRS ASCII characters).   If one starts using UTF-8, you may be 
> using 3-4-5 bytes per character, reducing the effective length of a message 
> to only 15-20 non-western characters.  Is this going to be enough be useful 
> for anything?

With UTF-8, 1 to 3 bytes per character in practice. There's a hard limit 
of 4 bytes per char in UTF-8, but practically everything fits in 3. 
Japanese and Chinese (I suppose) characters take 3 bytes for most or all 
characters, but it isn't so much of a problem, since all common words have 
their own character. Think about it: A single character represents a word.

"A guideline created by the Japanese Ministry of Education, the list of 
kyōiku kanji ("education kanji", a subset of jōyō kanji), specifies the 
1,006 simple characters a child is to learn by the end of sixth grade. 
Children continue to study another 939 characters in junior high school, 
covering in total 1,945 jōyō kanji." (Wikipedia: Japanese language)

In Finnish, we only have a couple of extra letters in use (åäö ÅÄÖ), and 
those characters take 2 bytes in UTF-8. The rest of our letters are ASCII, 
and take 1 byte per character.

I though the 67-byte limit was only present in old hardware which does not 
support UTF-8 or non-ascii characters anyway, and that they could be 
omitted from this discussion. Once a hardware manufacturer puts in enough 
memory to store the additional fonts for the new characters (2 megs of 
flash was Matti's guess), they can easily put in a few bytes of memory to 
support longer text messages.

Now that I re-read the APRS specification, it says:

"The message text may be up to 67 characters long, and may contain any 
printable ASCII characters except |, ~ or {."

Please note that it says *characters*, not *bytes*. :)

But OK, I suppose the authors meant bytes when they wrote characters, 
since they limited it to ASCII.

Once the definition is changed to support UTF-8, it can be changed to 
support longer messages.

>      Consider especially an APRS-to-email message.   By the 
> time you express the "name at server.domain" email address, there will 
> practically NO space for the message.

Wrong. With UTF-8, the first 128 characters are equal to ASCII. Those 
characters in the email address are still 8 bits, 1 byte, per character. 
name at server.domain takes just as many bytes in UTF-8 as in ASCII, because 
they're transmitted exactly the same way. 'name' is equal to 'name', and @ 
still takes 1 byte, because it's encoded the same in both ASCII and UTF-8: 
64 in decimal, 0x40 in hex, 01000000 in bits.

When you start to support additinal characters, you'll inevitably need 
more bits per character. I'm going to quote the Wikipedia article a bit:

"The UTF-8 encoding is variable-width, ranging from 1-4 bytes. Each byte 
has 0-4 leading 1 bits followed by a zero bit to indicate its type. N 1 
bits indicates the first byte in a N-byte sequence, with the exception 
that zero 1 bits indicates a one-byte sequence while one 1 bit indicates a 
continuation byte in a multi-byte sequence (this was done for ASCII 
compatibility).

[a nice table omitted]

So the first 128 characters (US-ASCII) need one byte. The next 1,920 
characters need two bytes to encode. This includes Latin letters with 
diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, 
Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of 
the Basic Multilingual Plane (which contains virtually all characters in 
common use). Four bytes are needed for characters in the other planes of 
Unicode, which include less common CJK characters and various historic 
scripts."

http://en.wikipedia.org/wiki/UTF-8#Description

> Not to mention the difficulty of entering these on DTMF pads on Kenwood 
> radios.  (Is it even possible ?).  [Not that I really care.  I think 
> entering even US ASCII text on a cellphone numeric keyboard is a perverse 
> masochistic exercise. I refuse to do any kind of text comms on less than a 
> full computer keyboard!] 
> It would appear that until the Kenwood radios fade from the APRS scene that 
> it will be really difficult to incorporate UTF-anything into the present 
> patchwork "kludge" of improvised workarounds and expedients for the 
> limitations of old packet hardware.

You're right - you won't be able to send or receive Japanese or 
Scandinavian characters using the old Kenwoods. It'll never be possible 
since their firmware won't be fixed to support this. And they probably 
wouldn't have enough memory for the fonts. [Not that I care at all.]

But is that a reason to *not* add support for additional languages in the 
protocol, so that any *new* and *future* software releases could support 
it? Especially when the change is backwards compatible, so that the old 
and new devices can still talk to each other, without any configuration 
changes, when English text is transmitted?

> Any extension that could incorporate UTF-8, longer text strings, more 
> robust encoding and error correction, etc will be so utterly different 
> from the present APRS protocols that I don't think it should be called 
> "APRS" at all.

I do not think so, adding support for UTF-8 isn't actually very difficult, 
because UTF-8 is designed to be backwards compatible with ASCII. In this 
thread, I'm only arguing about UTF-8, not about any major changes like 
robust encoding or error correction.

Hey, I'm sending this very message from an UTF-8 enabled system, and you 
can still read it! It really is backwards compatible! 日本語 (Japanese) 
may not work for you, and I don't understand it either, so we don't care 
about that part. But while I'm sending in UTF-8 and you're reading in 
ASCII, we can still understand each other because your ASCII-only
system can show my english UTF-8 text.

   - Hessu