Rich Text Format - Character Encoding

Character Encoding

A standard RTF file can consist of only 7-bit ASCII characters, but can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and, starting with RTF 1.5, Unicode escapes. In a code page escape, two hexadecimal digits following a backslash and typewriter apostrophe are used for denoting a character taken from a Windows code page. For example, if the code page is set to Windows-1256, the sequence \'c8 will encode the Arabic letter bāʼ (ب).

For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.

The control word \uc0 can be used to indicate that subsequent Unicode escape sequences within the current group do not specify a substitution character.

Until RTF specification version 1.5 release in 1997, RTF has only handled 7-bit characters directly and 8-bit characters encoded as hexadecimal (using \'xx). RTF control words (since RTF 1.5) generally accept signed 16-bit numbers as arguments. Unicode values greater than 32767 must be expressed as negative numbers. If a Unicode character is outside BMP, it is encoded with a surrogate pair. Support for Unicode was made due to text handling changes in Microsoft Word – Microsoft Word 97 is a partially Unicode-enabled application and it handles text using the 16-bit Unicode character encoding scheme. Microsoft Word 2000 and later versions are Unicode-enabled applications that handle text using the 16-bit Unicode character encoding scheme.

RTF files are usually 7-bit ASCII plain text. RTF consists of control words, control symbols, and groups. RTF files can be easily transmitted between PC based operating systems because they are encoded as a text file with 7-bit graphic ASCII characters. Converters that communicate with Microsoft Word for MS Windows or Macintosh should expect data transfer as 8-bit characters and binary data can contain any 8-bit values.

Read more about this topic:  Rich Text Format

Famous quotes containing the word character:

    It is true enough, Cambridge college is really beginning to wake up and redeem its character and overtake the age.
    Henry David Thoreau (1817–1862)