Variable-width Encoding - General Structure

General Structure

Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: singletons, which consist of a single unit, lead units, which come first in a multiunit sequence, and trail units, which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.

For example, the four character string "I♥NY" is encoded in UTF-8 like this (shown as hexadecimal byte values): 49 E2 99 A5 4E 59. Of the six units in that sequence, 49, 4E, and 59 are singletons (for I, N, and Y), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.

UTF-8 makes it easy for a program to identify the three sorts of units as they are kept apart. Older variable-width encodings are typically not so well designed, as in them the trail and lead units may use the same values, and in some all three sorts use overlapping values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.

Read more about this topic:  Variable-width Encoding

Famous quotes containing the words general and/or structure:

    Why not draft executive and management brains to prepare and produce the equipment the $21-a-month draftee must use and forget this dollar-a-year tommyrot? Would we send an army into the field under a dollar-a-year General who had to be home Mondays, Wednesdays and Fridays?
    Lyndon Baines Johnson (1908–1973)

    What is the most rigorous law of our being? Growth. No smallest atom of our moral, mental, or physical structure can stand still a year. It grows—it must grow; nothing can prevent it.
    Mark Twain [Samuel Langhorne Clemens] (1835–1910)