Universal Character Set - Encoding Forms of The Universal Character Set

Encoding Forms of The Universal Character Set

ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP. Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.

The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".

Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.

Currently, the dominant UCS encoding is UTF-8, which is a variable-width encoding designed for backward compatibility with ASCII, and for avoiding the complications of endianness and byte-order marks in UTF-16 and UTF-32. More than half of all Web pages are encoded in UTF-8. The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8. It is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

See also Comparison of Unicode encodings.

Read more about this topic:  Universal Character Set

Famous quotes containing the words forms, universal, character and/or set:

    The analogy between the mind and a computer fails for many reasons. The brain is constructed by principles that assure diversity and degeneracy. Unlike a computer, it has no replicative memory. It is historical and value driven. It forms categories by internal criteria and by constraints acting at many scales, not by means of a syntactically constructed program. The world with which the brain interacts is not unequivocally made up of classical categories.
    Gerald M. Edelman (b. 1928)

    Not so many years ago there there was no simpler or more intelligible notion than that of going on a journey. Travel—movement through space—provided the universal metaphor for change.... One of the subtle confusions—perhaps one of the secret terrors—of modern life is that we have lost this refuge. No longer do we move through space as we once did.
    Daniel J. Boorstin (b. 1914)

    But the wise know that foolish legislation is a rope of sand, which perishes in the twisting; that the State must follow, and not lead the character and progress of the citizen; the strongest usurper is quickly got rid of; and they only who build on Ideas, build for eternity; and that the form of government which prevails, is the expression of what cultivation exists in the population which permits it.
    Ralph Waldo Emerson (1803–1882)

    To Time it never seems that he is brave
    To set himself against the peaks of snow
    To lay them level with the running wave,
    Nor is he overjoyed when they lie low,
    But only grave, contemplative and grave.
    Robert Frost (1874–1963)