[aprssig] Please, standardize UTF-8 for APRS
Stephen H. Smith
wa8lmf2 at aol.com
Mon Sep 21 17:41:15 EDT 2009
Heikki Hannikainen wrote:
> On Mon, 21 Sep 2009, Stephen H. Smith wrote:
>
>>
>> Is this going to be practical AT ALL within the existing APRS
>> framework?
>
> Yes. It does take more than one byte per character, if you send
> non-english text, but if that language uses mostly ASCII characters
> (many languages only use a small set of extra letters), the ASCII
> characters are still the same, one byte per character.
>
>> Currently the APRS message string length is limited to 60-70 bytes
>> (i.e. classic APRS ASCII characters). If one starts using UTF-8,
>> you may be using 3-4-5 bytes per character, reducing the effective
>> length of a message to only 15-20 non-western characters. Is this
>> going to be enough be useful for anything?
>
> With UTF-8, 1 to 3 bytes per character in practice. There's a hard
> limit of 4 bytes per char in UTF-8, but practically everything fits in
> 3. Japanese and Chinese (I suppose) characters take 3 bytes for most
> or all characters, but it isn't so much of a problem, since all common
> words have their own character. Think about it: A single character
> represents a word.
>
>
>
>
Actually, I think Korean may be the most efficient PHONETIC alphabet of
all! As I understand it, each glyph of Korean encodes an entire
phoneme. It may not be quite as compact as Chinese or Japanese, but it
DOES tell you how to pronounce it. [That ought to make text-to-speech
synthesizers really easy!]
>
> In Finnish, we only have a couple of extra letters in use (åäö ÅÄÖ),
> and those characters take 2 bytes in UTF-8. The rest of our letters
> are ASCII, and take 1 byte per character.
>
> I though the 67-byte limit was only present in old hardware which does
> not support UTF-8 or non-ascii characters anyway, and that they could
> be omitted from this discussion.
Again, I think the real problem is what IBM once referred to as the
"tyranny of the installed base" - that existing hardware and software is
going to limit seemingly rational changes for years or decades until
older hardware and software falls out of common use.
In this case, it's the older Kenwood radios with their notoriously
incompetent built-in TNCs.
[A bit of background - the Kenwoods are actually repurposed radios
originally designed for an APRS-like domestic Japanese protocol called
"NaviTra" -- Navigation Transceiver. When they flopped commercially,
Kenwood recycled the hardware design with an implementation of TNC2-like
packet firmware and APRS functions "shoehorned" into the same
underpowered Tasco TNC-on-a-chip custom microcontroller. The Tasco
chip barely has enough buffer space for short APRS packets (100-127
bytes max) and fails dealing with longer "conventional packet"
transmissions. ]
You can see some pictures of the NaviTra predecessors of the D700 here
on my website:
<http://wa8lmf.net/aprs/index.htm>
(Scroll down to about the last 1/4 of the page.)
> Once a hardware manufacturer puts in enough memory to store the
> additional fonts for the new characters (2 megs of flash was Matti's
> guess), they can easily put in a few bytes of memory to support longer
> text messages.
>
> Now that I re-read the APRS specification, it says:
>
> "The message text may be up to 67 characters long, and may contain any
> printable ASCII characters except |, ~ or {."
>
> Please note that it says *characters*, not *bytes*. :)
>
> But OK, I suppose the authors meant bytes when they wrote characters,
> since they limited it to ASCII.
>
> Once the definition is changed to support UTF-8, it can be changed to
> support longer messages.
>
>
Again, I posit that UTF extensions won't be practical until the Kenwoods
fade from common use, so we don't have to worry about their
underpowered TNCs locking up and crashing on strings of 60-70
displayable characters that are actually represented by 150 bytes or
more of actual data sent by newer devices.
>
>> Consider especially an APRS-to-email message. By the time you
>> express the "name at server.domain" email address, there will
>> practically NO space for the message.
>
> Wrong. With UTF-8, the first 128 characters are equal to ASCII. Those
> characters in the email address are still 8 bits, 1 byte, per
> character. name at server.domain takes just as many bytes in UTF-8 as in
> ASCII, because they're transmitted exactly the same way. 'name' is
> equal to 'name', and @ still takes 1 byte, because it's encoded the
> same in both ASCII and UTF-8: 64 in decimal, 0x40 in hex, 01000000 in
> bits.
I realize that the UTF-coded email address is no longer than in "classic
ASCII" (assuming that the Chinese, Japanese and Israelis don't follow
through with plans to use UTF-8 characters in domain names!) However,
after the initial 10-20 bytes/characters are used up in the address ,
the remaining bytes will only support a very few multi-byte-coded
characters, assuming you want to keep the actual data string length
"safe" for older devices like the Kenwoods.
>
>
>> Any extension that could incorporate UTF-8, longer text strings, more
>> robust encoding and error correction, etc will be so utterly
>> different from the present APRS protocols that I don't think it
>> should be called "APRS" at all.
>
> I do not think so, adding support for UTF-8 isn't actually very
> difficult, because UTF-8 is designed to be backwards compatible with
> ASCII. In this thread, I'm only arguing about UTF-8, not about any
> major changes like robust encoding or error correction.
>
> Hey, I'm sending this very message from an UTF-8 enabled system, and
> you can still read it! It really is backwards compatible! 日本語
> (Japanese) may not work for you, and I don't understand it either, so
> we don't care about that part. But while I'm sending in UTF-8 and
> you're reading in ASCII, we can still understand each other because
> your ASCII-only system can show my english UTF-8 text.
>
> .
Actually I *AM* viewing this in UTF-8 (Thunderbird 2.0.0.23
email client) -- the Japanese (and the multiple languages in the earlier
example) showed up just fine, thanks to having the 24 megabyte "Arial
Unicode" font (which includes 40,000+ glyphs including CJK) set as my
default display font.
But then I am running far more computer "horsepower" (Clevo 470 3GHz
laptop with 17" screen) than I would expect embedded in
relatively-low-cost relatively-low-production-volume ham radio
gadgets. I.e. you don't find the processor speeds and RAM/ROM sizes
remotely approaching a Nokia smartphone or iPhone in even the latest ham
gear.
------------------------------------------------------------------------
--
Stephen H. Smith wa8lmf (at) aol.com
EchoLink Node: WA8LMF or 14400 [Think bottom of the 2M band]
Skype: WA8LMF
Home Page: http://wa8lmf.net
NEW! HF APRS Notes & Guide
http://wa8lmf.net/aprs/HF_APRS_Notes.htm
"APRS 101" Explanation of APRS Path Selection & Digipeating
http://wa8lmf.net/DigiPaths
Updated "Rev H" APRS http://wa8lmf.net/aprs
Symbols Set for UI-View,
UIpoint and APRSplus:
More information about the aprssig
mailing list