Universal Character Set - History of ISO 10646

History of ISO 10646

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross was one of its principal architects. That standard differed markedly from the current one. It defined :

128 groups of
256 planes of
256 rows of
256 cells,

for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of C0 and C1 control codes (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) in any one of the four bytes specifying a group, plane, row and cell. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.

One could code the characters of this primordial ISO 10646 standard in one of three ways:

UCS-4, four bytes for every character, enabling the simple encoding of all characters;
UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;
UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control codes).

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardizers realized they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control code values), thus opening code points like 0x0000101F for allocation; and the synchronization of the repertoire of the Basic Multilingual Plane with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 code points from 17 planes by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 679 millions. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32, although it has not almost any use outside programs' internal data.

Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8, currently the most popular UCS encoding.

Read more about this topic: Universal Character Set

Famous quotes containing the words history of and/or history:

“The history of every country begins in the heart of a man or a woman.”
—Willa Cather (1876–1947)

“All objects, all phases of culture are alive. They have voices. They speak of their history and interrelatedness. And they are all talking at once!”
—Camille Paglia (b. 1947)

Related Phrases

Related Words