[aprssig] Please, standardize UTF-8 for APRS

Stephen H. Smith wa8lmf2 at aol.com
Mon Sep 21 17:41:15 EDT 2009


Heikki Hannikainen wrote:
> On Mon, 21 Sep 2009, Stephen H. Smith wrote:
>
>>
>> Is this going to be practical AT ALL  within the existing APRS 
>> framework?
>
> Yes. It does take more than one byte per character, if you send 
> non-english text, but if that language uses mostly ASCII characters 
> (many languages only use a small set of extra letters), the ASCII 
> characters are still the same, one byte per character.
>
>> Currently the APRS message string length is limited to 60-70 bytes 
>> (i.e. classic APRS ASCII characters).   If one starts using UTF-8, 
>> you may be using 3-4-5 bytes per character, reducing the effective 
>> length of a message to only 15-20 non-western characters.  Is this 
>> going to be enough be useful for anything?
>
> With UTF-8, 1 to 3 bytes per character in practice. There's a hard 
> limit of 4 bytes per char in UTF-8, but practically everything fits in 
> 3. Japanese and Chinese (I suppose) characters take 3 bytes for most 
> or all characters, but it isn't so much of a problem, since all common 
> words have their own character. Think about it: A single character 
> represents a word.
>
>
>
>



Actually, I think Korean may be the most efficient PHONETIC alphabet of 
all! As I understand it, each glyph of Korean encodes an entire 
phoneme.  It may not be quite as compact as Chinese or Japanese, but it 
DOES tell you how to pronounce it.    [That ought to make text-to-speech 
synthesizers really easy!]



>
> In Finnish, we only have a couple of extra letters in use (åäö ÅÄÖ), 
> and those characters take 2 bytes in UTF-8. The rest of our letters 
> are ASCII, and take 1 byte per character.
>
> I though the 67-byte limit was only present in old hardware which does 
> not support UTF-8 or non-ascii characters anyway, and that they could 
> be omitted from this discussion. 

Again, I think the real problem is what IBM once referred to as the 
"tyranny of the installed base" - that existing hardware and software is 
going to limit seemingly rational changes for years or decades until 
older hardware and software falls out of common use.     


In this case, it's the older Kenwood radios with their notoriously 
incompetent built-in TNCs. 

[A bit of background - the Kenwoods are actually repurposed radios 
originally designed for an APRS-like domestic Japanese protocol called 
"NaviTra" -- Navigation Transceiver.   When they flopped commercially, 
Kenwood recycled the hardware design with an implementation of TNC2-like 
packet firmware and APRS functions "shoehorned" into the same 
underpowered Tasco  TNC-on-a-chip custom microcontroller.    The Tasco 
chip barely has enough buffer space for short APRS packets (100-127 
bytes max) and fails dealing with longer "conventional packet" 
transmissions. ]   


You can see some pictures of the NaviTra predecessors of the D700 here 
on my website:
     <http://wa8lmf.net/aprs/index.htm>   


(Scroll down to about the last 1/4 of the page.)




> Once a hardware manufacturer puts in enough memory to store the 
> additional fonts for the new characters (2 megs of flash was Matti's 
> guess), they can easily put in a few bytes of memory to support longer 
> text messages.
>
> Now that I re-read the APRS specification, it says:
>
> "The message text may be up to 67 characters long, and may contain any 
> printable ASCII characters except |, ~ or {."
>
> Please note that it says *characters*, not *bytes*. :)
>
> But OK, I suppose the authors meant bytes when they wrote characters, 
> since they limited it to ASCII.
>
> Once the definition is changed to support UTF-8, it can be changed to 
> support longer messages.
>
>

Again, I posit that UTF extensions won't be practical until the Kenwoods 
fade from common use, so we don't have to  worry about their 
underpowered  TNCs locking up and crashing on strings of 60-70 
displayable characters that are actually represented by 150 bytes or 
more of actual data sent by newer devices.   

>
>>      Consider especially an APRS-to-email message.   By the time you 
>> express the "name at server.domain" email address, there will 
>> practically NO space for the message.
>
> Wrong. With UTF-8, the first 128 characters are equal to ASCII. Those 
> characters in the email address are still 8 bits, 1 byte, per 
> character. name at server.domain takes just as many bytes in UTF-8 as in 
> ASCII, because they're transmitted exactly the same way. 'name' is 
> equal to 'name', and @ still takes 1 byte, because it's encoded the 
> same in both ASCII and UTF-8: 64 in decimal, 0x40 in hex, 01000000 in 
> bits.




I realize that the UTF-coded email address is no longer than in "classic 
ASCII" (assuming that the Chinese, Japanese and Israelis don't follow 
through with plans to use UTF-8 characters in domain names!)  However, 
after the initial 10-20 bytes/characters are used up in the address  , 
the remaining bytes will only support a very few multi-byte-coded 
characters, assuming you want to keep the actual data string length 
"safe" for older devices like the Kenwoods. 



>
>
>> Any extension that could incorporate UTF-8, longer text strings, more 
>> robust encoding and error correction, etc will be so utterly 
>> different from the present APRS protocols that I don't think it 
>> should be called "APRS" at all.
>
> I do not think so, adding support for UTF-8 isn't actually very 
> difficult, because UTF-8 is designed to be backwards compatible with 
> ASCII. In this thread, I'm only arguing about UTF-8, not about any 
> major changes like robust encoding or error correction.
>
> Hey, I'm sending this very message from an UTF-8 enabled system, and 
> you can still read it! It really is backwards compatible! 日本語 
> (Japanese) may not work for you, and I don't understand it either, so 
> we don't care about that part. But while I'm sending in UTF-8 and 
> you're reading in ASCII, we can still understand each other because 
> your ASCII-only  system can show my english UTF-8 text.
>
>  .   

Actually I     *AM*     viewing this in UTF-8 (Thunderbird 2.0.0.23 
email client) -- the Japanese (and the multiple languages in the earlier 
example) showed up just fine,  thanks to having the 24 megabyte "Arial 
Unicode" font (which includes 40,000+ glyphs including CJK) set as my 
default display font.


But then I am running far more computer "horsepower"  (Clevo 470 3GHz 
laptop with 17" screen) than I would expect embedded in     
relatively-low-cost    relatively-low-production-volume     ham radio 
gadgets.   I.e. you don't find the processor speeds and RAM/ROM sizes 
remotely approaching a Nokia smartphone or iPhone in even the latest ham 
gear.


------------------------------------------------------------------------

--

Stephen H. Smith    wa8lmf (at) aol.com
EchoLink Node:      WA8LMF  or 14400    [Think bottom of the 2M band]
Skype:        WA8LMF
Home Page:          http://wa8lmf.net

NEW!  HF APRS Notes & Guide
  http://wa8lmf.net/aprs/HF_APRS_Notes.htm

"APRS 101"  Explanation of APRS Path Selection & Digipeating
  http://wa8lmf.net/DigiPaths

Updated "Rev H" APRS            http://wa8lmf.net/aprs
Symbols Set for UI-View,
UIpoint and APRSplus:






More information about the aprssig mailing list