[aprssig] APRS Message character sets ?

Thu Jul 10 07:05:54 EDT 2008

On Wed, Jul 09, 2008 at 06:41:35PM -0700, Stephen H. Smith wrote:
> From: "Stephen H. Smith" <wa8lmf2 at aol.com>
> 
> Matti Aarnio wrote:
> > I am asking the  aprssig  to form a consensus of what shall be the canonic
> > message character-set and encoding beyond US-ASCII ?
> >
> >
> > he APRS specification 1.0.1 defines that used character set is ASCII in
> > many places where it defines charactersets at all.
> >
> > This is quite understandable in view of TNC2 interfaced systems with
> > 7E1 type communication channel -- which is unable to pass thru any extended
> > characters.
> 
> 
> 1)    TNC2-type systems are normally initialized to 8-N-1 (not the 
>       default 7-E-1) for APRS with the  TNC commands  ...
> (or equivalent for other TNCs) so that they WILL be transparent to 8-bit 
> characters.  This is required so that any arbitrary character values 
> created by Mic-E encoding will pass.    In fact I have modified the TNC2 
> firmware Ver 1.1.9 to DEFAULT to 8-N-1 bit transparent mode so no init 
> is even required.

My point is more along the lines that even OLD TNC2 will work when commanded
into KISS mode.  I have enough experiences with "sanitized 8N1 text monitor"
modes to know that they are very bad idea  -->  KISS mode into use.
(And doing bi-directional igate with such monitor mode is wrought with
unreliabilities.)

> 2)    Since the APRS network infrastructure is heavily based on legacy 
> 1980s-1990s packet hardware that doesn't support 16-bit/character 
> encoding (i.e. UniCode), I wouldn't hold my breath for 16-bit support 
> any time soon. 

You are confusing user interfaces and network infrastructures.
You are also confusing character set and its encoding.

The network infrastructure handles just sequences of 8-bit bytes that
are presentable as text lines meaning that byte codes 0x0d 0x0a designate
end of line.

Even the UTF-16 characters are just pairs of such 8-bit bytes that are
considered as encoding for single character of Unicode codespaces.

And by the way, traditionally US people are completely ignoring the issue
of international character sets.  When one does not need characters outside
US-ASCII, all is fine and dandy with limiting everything to US-ASCII.

Europe is different, Asia even more so.

We Finns need actively 4 characters outside the US-ASCII, a few more
infrequently.  We can do our things with ISO-8859-15 (including the Euro
currency symbol.)

       ISO 8859-1    West European languages (Latin-1)
       ISO 8859-2    Central and East European languages (Latin-2)
       ISO 8859-3    Southeast European and miscellaneous languages (Latin-3)
       ISO 8859-4    Scandinavian/Baltic languages (Latin-4)
       ISO 8859-5    Latin/Cyrillic
       ISO 8859-6    Latin/Arabic
       ISO 8859-7    Latin/Greek
       ISO 8859-8    Latin/Hebrew
       ISO 8859-9    Latin-1 modification for Turkish (Latin-5)
       ISO 8859-10   Lappish/Nordic/Eskimo languages (Latin-6)
       ISO 8859-11   Latin/Thai
       ISO 8859-13   Baltic Rim languages (Latin-7)
       ISO 8859-14   Celtic (Latin-8)
       ISO 8859-15   West European languages (Latin-9)
       ISO 8859-16   Romanian (Latin-10)

All are sufficient for their relevant sub-areas, but when you do not carry
info of what character set is in use, the greek texts appear on our screen
as a bunch of accented latin alphabets.

>    Not to mention that the most widely used APRS 
> application (UI-View) is now frozen in time and unchangeable.    

I do think that UI-View is dead software that should be deprecated, and
as soon as somebody makes similar quality modern software THAT GETS
SUFFICIENT PROMOTING, it will be irrelevant as to the message encodings.
(No, I do not write Windows software.)

The real problem is that there is no specification of what to do when needing
to go beyond US-ASCII, just dozen(s) of ad-hoc solutions, all incompatible.
  - sending PC DOS characters
  - sending Windows characters
  - sending ISO-8859-XX characters 
  - sending KOI8-R encoding characters
  - sending Unicode in UTF-8 sequences
  - sending Unicode in UTF-16 sequences

Most surprises are caused by UTF-16 encoders sending US-ASCII text;
every second byte has value  0x00.   That particular byte value does cause
tons of problems for C programmers who consider it as end of string token.
(When coding without explicite string length info.)

People being people they WILL use APRS messages to send native language
texts, and that will mean often going outside US-ASCII.

On "message" messages I would prefer some updated standard, preferrably
ASCII compatible one, which in practice means UTF-8.   On other type of
messages there are some comment fields, which I have seen(*) carrying
characters outside ASCII -- and those pesky zero bytes.

*) Seen on APRS-IS traffic dumps written to pick _all_ of the traffic.
   Several such occur around the network at any hour.

> --
> Stephen H. Smith    wa8lmf (at) aol.com
> Home Page:          http://wa8lmf.com  --OR--   http://wa8lmf.net

73 de Matti, OH2MQK