[aprssig] Please, standardize UTF-8 for APRS (was: Future Concept for APRS)
Heikki Hannikainen
hessu at hes.iki.fi
Mon Sep 21 10:24:18 EDT 2009
On Mon, 21 Sep 2009, Sergej wrote:
> Will be enough 8bit chars set to include regional subsets.
> But for Chinese or Japan hams possible UTF-x encoding is better?
UTF-8 would be very, very good for everyone involved (including
Sergej), even if your particular character set happens to fit within 8
bits.
I'd like to make it work in a single application (aprs.fi) for everyone
without trying to guess whether the message is in Russian, Finnish or
Japanese. If UTF-8 would be used (like is typically done in email and web
today), there'd be no need to select or guess which character set is used
in each message.
This message was sent in the UTF-8 encoding of Unicode. For those of you
who have UTF-8 support in your email software (probably almost all of you
by now), and the relevant fonts installed (this is what usually fails for
Windows users) you should see these strings correctly:
Russian: русский язык
Traditional Chinese: 漢語
Japanese: 日本語
Finnish: Ääliöt ja pölvästit
French: Français
For those of you who do not have Unicode support, or have not installed
the fonts containing symbols for all of those funny characters, you'll
probably see little rectangles, gibberish, or strings like
"=C3=84=C3=A4=C3=B6=C3=B6=C3=B6" instead of the correct glyphs. But,
because UTF-8 is backwards compatible with ASCII, at least you'll see this
English text correctly! ASCII characters have the same single-byte values
in ASCII and UTF-8.
If some other Unicode encoding, like UTF-16, would be used, the English
parts would look really funny, or would not be visible at all. Every other
character would be binary zero (or NULL), and every other character would
be the ASCII character. It's also a waste of valuable bandwidth when
sending ASCII text! Here's a real-world UTF-16 example (the control codes
are shown in hex in the raw packets display):
http://aprs.fi/?c=raw&call=JA6VRP-2
There's a good chance that a lot of old software written in C will cut
those messages at the first NULL byte, since the string handling functions
use it as the "end-of-string" marker.
If UTF-8 would be used on APRS, users would not need to switch between
ASCII (backwards compatible for messaging with English-speaking friends)
and some other character set (for the non-English messages). It would
"just work". It works for the Internet, and it'd work for us.
Links:
http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/UTF-16
- Hessu
More information about the aprssig
mailing list