Universal Character Set - Encoding Forms of The Universal Character Set

Encoding Forms of The Universal Character Set

ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP. Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.

The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".

Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.

Currently, the dominant UCS encoding is UTF-8, which is a variable-width encoding designed for backward compatibility with ASCII, and for avoiding the complications of endianness and byte-order marks in UTF-16 and UTF-32. More than half of all Web pages are encoded in UTF-8. The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8. It is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

See also Comparison of Unicode encodings.

Read more about this topic:  Universal Character Set

Famous quotes containing the words forms, universal, character and/or set:

    Two forms move among the dead, high sleep
    Who by his highness quiets them, high peace
    Upon whose shoulders even the heavens rest,
    Two brothers. And a third form, she that says
    Good-by in the darkness, speaking quietly there,
    To those that cannot say good-by themselves.
    Wallace Stevens (1879–1955)

    The axioms of physics translate the laws of ethics. Thus, “the whole is greater than its part;” “reaction is equal to action;” “the smallest weight may be made to lift the greatest, the difference of weight being compensated by time;” and many the like propositions, which have an ethical as well as physical sense. These propositions have a much more extensive and universal sense when applied to human life, than when confined to technical use.
    Ralph Waldo Emerson (1803–1882)

    I wasn’t born to be a fighter. I was born with a gentle nature, a flexible character and an organism as equilibrated as it is judged hysterical. I shouldn’t have been forced to fight constantly and ferociously. The causes I have fought for have invariably been causes that should have been gained by a delicate suggestion. Since they never were, I made myself into a fighter.
    Margaret Anderson (1886–1973)

    The spiral is a spiritualized circle. In the spiral form, the circle, uncoiled, unwound, has ceased to be vicious; it has been set free.
    Vladimir Nabokov (1899–1977)