Fun_People Archive
24 Sep
What is ASCII? Isn't that the same as ISO Latin? Or is that "10646"


Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Peter Langston <psl>
Date: Fri, 24 Sep 99 17:23:25 -0700
To: Fun_People
Precedence: bulk
Subject: What is ASCII?  Isn't that the same as ISO Latin?  Or is that "10646"
	-- Unicode, right?
References: <02d901bf0613$1a3e50a0$d8a4f2d0@lowellt>

X-Lib-of-Cong-ISSN: 1098-7649  -=[ Fun_People ]=-
X-http://www.langston.com/psl-bin/Fun_People.cgi

From: Jon Callas <jon@callas.org>

In our last episode ("Re: [IRR] ASCII.ORG", shown on 9/23/99), Eric A. Hall
said:

>Aren't UTF-8 and UTF-16 the general standards now? I'm not a charset
>expert but that's what I'm (possibly mis-) hearing.


UTF-8 and -16 are encodings, not character sets. They are ways of encoding
multi-byte streams onto an 8 or 16 bit stream. There are also UTF-7 and
UTF-5, I believe, for even more limited space.

What they typically encode is Unicode, or ISO 10646. Here's the history:

The first standard character set was ASCII, a.k.a. FIPS 1, as well as being
ANSI and ISO in numbers I can't remember. It's the 7-bit character set that
everyone knows and loves.

ASCII appeared in 1966. It predates the existence of the byte. In those
days, machines were typically word-addressable, and the words were often
12, 18, 32, or 36 bits long. In those days, characters were packed into
words, and "bytes" were the subdivisions of those words. There were common
byte sizes of 6 to 9 bits. In the standards world today, we still typically
say "octet" instead of "byte" to be clear that we're talking about 8-bit
bytes.

There were character sets that matched all those byte sizes. Let's not go
there, but I could wax poetic for a while.

The first extension to ASCII was ISO Latin, which grew out of the DEC
Multinational character set. It was an eight-bit character set that put
characters needed for West European languages in the top 128 characters.
There were then a number of other ISO Latin character sets that were for
East European, Cyrillic, Greek, Hebrew, Arabic, etc.

Obviously, 256 characters is not enough, especially when you consider Asian
characters. There have been a number of Japanese sets, for example, with
some 5000 to 15000 character in them. There have been also 2-byte Korean
set, and Chinese sets. There was also a set from Taiwain that was a
four-byte set, and horrifically enough, also a 3-byte set for a while.

There have been a number of attempts to come up with some common character
set. The two important ones are Unicode and ISO 10646. Unicode is a 2-byte
set that comes out of work at DEC, Xerox, and Apple. ISO 10646 is a
multi-byte set, going up to 4-bytes in variable numbers.

Unicode took a smart approach. Just as the ISO Latin series has ASCII as
the bottom 128 characters, Unicode has ISO Latin-1 as the bottom 256
characters. It also has the other ISO Latin sequences in easy-to-translate
places. They have also done an amazing job in the diplomacy department by
getting all of the Asian folks to agree to the principle that if two
characters look alike, they're represented by the same glyph, even if they
may in some sense be a different character. (Note, this has been true even
in the ASCII days. For example, there's a single character for hypen and
minus, as well as a single character for the single quote and the
apostrophe.) Ironically, it was hardest to get the Japanese to get to go
along with this.

A number of years ago, there was also another major diplomatic coup by the
Unicode folk by getting the 10646 folk to agree that the 2-byte 10646 would
*be* Unicode. So now, typically, people talk about them as if they're the
same because in some sense they are. But really, Unicode is the low 64K
glyphs of ISO 10646.

The UTF encodings were first created by the Unicode folk. They are there
just to  answer the question: how do you send 16-bit character in email,
when you can only guarantee a 7-bit stream? Their solution was UTF-7. It
uses 7-bit-clean characters, and sequences multiple characters to get the
higher-numbered characters. UTF-8 is similar for 8-bit streams, and UTF-16
for larger streams.

 Historically and Pedantically,
 Jon


prev [=] prev © 1999 Peter Langston []