[aprssig] APRS Message character sets ?

Wed Jul 9 18:37:11 EDT 2008

I am asking the  aprssig  to form a consensus of what shall be the canonic
message character-set and encoding beyond US-ASCII ?

he APRS specification 1.0.1 defines that used character set is ASCII in
many places where it defines charactersets at all.

This is quite understandable in view of TNC2 interfaced systems with
7E1 type communication channel -- which is unable to pass thru any extended
characters.

However, I have reasons to believe that igate systems with such connections
are in rarity.  There are large parts of the world where the ASCII is
insufficient, and thus plain TNC2 in text monitor mode is not used
(not to mention its tendency to break MIC-e frames as well..)
Instead we use KISS interfaced modems that are able to pass arbitrary
binary contents.

Now what people are actually using for messaging in between each other
apparently include:

   - US-ASCII
   - PC DOS charset
   - PC Windows charsets (many)
   - Many ISO-8859-*  variations
   - UTF-8
   - UTF-16

The UTF-16 is much used by  AGWtracker,  the charsets with 8-bit characters
are abundant with other softwares, depending on what charset user has
configured the keyboard to use (or software vendor has decided that is
the current default.)

This babel of charsets makes reading messages with characters outside
the US-ASCII somewhat challenging  --  when both users have e.g. UI-View
on Windows, both systems behave same way -- Windows latin charsets.
Take some Linux user into the mix, and they do not see A-umlauts
at all, because on Windows the code point is 0x84 (as I recall), while
on ISO-8859-* that code point is "reserved unallocated".

Indeed there are about a dozen of ISO 8859 charsets all for widely
separate languages with widely separate glyph sets.
All of them have one common thing though:  US-ASCII forms code-points
0 to 127 on all of them.
Some examples of these on code-point 0xAF:
  iso-8859-1:   MACRON
  iso-8859-2:   LATIN CAPITAL LETTER Z WITH DOT ABOVE
  iso-8859-7:   HORIZONTAL BAR
  iso-8859-9:   MACRON

  KOI8-R:       FORMS DOUBLE UP AND LEFT   (Cyrillic)  

etc.  (Pick windows codepages, and you will have really merry mess...)

On wire the UTF-8 would be compatible with US-ASCII as its subset for
code-points of 0 thru 127.  On code-points 128-255 (and beyond) 
the differences appear.

There are lots and lots of old software running,  but the PC-software
is simpler to update than hardware  --  we are still making new hardware
things with ancient Bell-202 modems on them just because existing systems
use that modulation.

For example the UTF-16 has following nasty habit when encoding ASCII
text:  

   Every character is represented with two bytes.

   If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.
   (if U is bigger, encoding calls for two integers.)

   Canonical wire-order presentation of the 16-bit unsigned integer is
   to put high byte first, then low byte.

   (About UTF-16, see:  http://www.ietf.org/rfc/rfc2781.txt  )

Encoding US-ASCII text  "message" results in byte sequence:

   "\000m\000e\000s\000s\000a\000g\000e"

that is, every second byte is NUL,  which on carelessly done C programs
means that the string ends at first NUL byte - before the "m" character.
What is more annoying is the space wastage.   For west european languages
where most of the used base alphabets are 'A' thru 'Z' and all fancy
things like A-umlauts are in minority use, and thus the use of UTF-16
is waste -- not to mention the NUL-byte hazard..  For languages where
ASCII characters are rare, like Cyrillic, Greek etc. things are different.

In my opinnion we are already long overdue of specifying correct way
to extend used character set beyond the ASCII.   Ad-hoc de-facto
extensions are already causing interoperability problems.

73 de Matti, OH2MQK