[aprssig] Please, standardize UTF-8 for APRS (was: Future Concept for APRS)

Mon Sep 21 10:24:18 EDT 2009

On Mon, 21 Sep 2009, Sergej wrote:

> Will be enough 8bit chars set to include regional subsets.
> But for Chinese or Japan hams possible UTF-x encoding is better?

UTF-8 would be very, very good for everyone involved (including 
Sergej), even if your particular character set happens to fit within 8 
bits.

I'd like to make it work in a single application (aprs.fi) for everyone 
without trying to guess whether the message is in Russian, Finnish or 
Japanese. If UTF-8 would be used (like is typically done in email and web 
today), there'd be no need to select or guess which character set is used 
in each message.

This message was sent in the UTF-8 encoding of Unicode. For those of you 
who have UTF-8 support in your email software (probably almost all of you 
by now), and the relevant fonts installed (this is what usually fails for 
Windows users) you should see these strings correctly:

Russian: русский язык
Traditional Chinese: 漢語
Japanese: 日本語
Finnish: Ääliöt ja pölvästit
French: Français

For those of you who do not have Unicode support, or have not installed 
the fonts containing symbols for all of those funny characters, you'll 
probably see little rectangles, gibberish, or strings like 
"=C3=84=C3=A4=C3=B6=C3=B6=C3=B6" instead of the correct glyphs. But, 
because UTF-8 is backwards compatible with ASCII, at least you'll see this 
English text correctly! ASCII characters have the same single-byte values 
in ASCII and UTF-8.

If some other Unicode encoding, like UTF-16, would be used, the English 
parts would look really funny, or would not be visible at all. Every other 
character would be binary zero (or NULL), and every other character would 
be the ASCII character. It's also a waste of valuable bandwidth when 
sending ASCII text! Here's a real-world UTF-16 example (the control codes 
are shown in hex in the raw packets display):

http://aprs.fi/?c=raw&call=JA6VRP-2

There's a good chance that a lot of old software written in C will cut 
those messages at the first NULL byte, since the string handling functions 
use it as the "end-of-string" marker.

If UTF-8 would be used on APRS, users would not need to switch between 
ASCII (backwards compatible for messaging with English-speaking friends) 
and some other character set (for the non-English messages). It would 
"just work". It works for the Internet, and it'd work for us.

Links:

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/UTF-16

   - Hessu