Arithmetic Coding As A Generalized Change of Radix
Recall that in the case where the symbols had equal probabilities, arithmetic coding could be implemented by a simple change of base, or radix. In general, arithmetic (and range) coding may be interpreted as a generalized change of radix. For example, we may look at any sequence of symbols:
as a number in a certain base presuming that the involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer A = 0, B = 1, C = 2, D = 3, and so on. This results in the following frequencies and cumulative frequencies:
Symbol | Frequency of occurrence | Cumulative frequency |
---|---|---|
A | 1 | 0 |
B | 2 | 1 |
D | 3 | 3 |
The cumulative frequency is the total of all frequencies below it in a frequency distribution (a running total of frequencies).
In a positional numeral system the radix, or base, is numerically equal to a number of different symbols used to express the number. For example, in the decimal system the number of symbols is 10, namely 0,1,2,3,4,5,6,7,8,9. The radix is used to express any finite integer in a presumed multiplier in polynomial form. For example, the number 457 is actually 4×102 + 5×101 + 7×100, where base 10 is presumed but not shown explicitly.
Initially, we will convert DABDDB into a base-6 numeral, because 6 is the length of the string. The string is first mapped into the digit string 301331, which then maps to an integer by the polynomial:
The result 23671 has a length of 15 bits, which is not very close to the theoretical limit (the entropy of the message), which is approximately 9 bits.
To encode a message with a length closer to the theoretical limit imposed by information theory we need to slightly generalize the classic formula for changing the radix. We will compute lower and upper bounds L and U and choose a number between them. For the computation of L we multiply each term in the above expression by the product of the frequencies of all previously occurred symbols:
The difference between this polynomial and the polynomial above is that each term is multiplied by the product of the frequencies of all previously occurring symbols. More generally, L may be computed as:
where are the cumulative frequencies and are the frequencies of occurrences. Indexes denote the position of the symbol in a message. In the special case where all frequencies are 1, this is the change-of-base formula.
The upper bound U will be L plus the product of all frequencies; in this case U = L + (3 × 1 × 2 × 3 × 3 × 2) = 25002 + 108 = 25110. In general, U is given by:
Now we can choose any number from the interval [L, U) to represent the message; one convenient choice is the value with the longest possible trail of zeroes, 25100, since it allows us to achieve compression by representing the result as 251×102. The zeroes can also be truncated, giving 251, if the length of the message is stored separately. Longer messages will tend to have longer trails of zeroes.
To decode the integer 25100, the polynomial computation can be reversed as shown in the table below. At each stage the current symbol is identified, then the corresponding term is subtracted from the result.
Remainder | Identification | Identified symbol | Corrected remainder |
---|---|---|---|
25100 | 25100 / 65 = 3 | D | (25100 − 65 × 3) / 3 = 590 |
590 | 590 / 64 = 0 | A | (590 − 64 × 0) / 1 = 590 |
590 | 590 / 63 = 2 | B | (590 − 63 × 1) / 2 = 187 |
187 | 187 / 62 = 5 | D | (187 − 62 × 3) / 3 = 26 |
26 | 26 / 61 = 4 | D | (26 − 61 × 3) / 3 = 2 |
2 | 2 / 60 = 2 | B |
During decoding we take the floor after dividing by the corresponding power of 6. The result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified the result is corrected. The process is continued for the known length of the message or while the remaining result is positive. The only difference compared to the classical change-of-base is that there may be a range of values associated with each symbol. In this example, A is always 0, B is either 1 or 2, and D is any of 3, 4, 5. This is in exact accordance with our intervals that are determined by the frequencies. When all intervals are equal to 1 we have a special case of the classic base change.
Read more about this topic: Arithmetic Coding
Famous quotes containing the words arithmetic, generalized and/or change:
“Under the dominion of an idea, which possesses the minds of multitudes, as civil freedom, or the religious sentiment, the power of persons are no longer subjects of calculation. A nation of men unanimously bent on freedom, or conquest, can easily confound the arithmetic of statists, and achieve extravagant actions, out of all proportion to their means; as, the Greeks, the Saracens, the Swiss, the Americans, and the French have done.”
—Ralph Waldo Emerson (18031882)
“One is conscious of no brave and noble earnestness in it, of no generalized passion for intellectual and spiritual adventure, of no organized determination to think things out. What is there is a highly self-conscious and insipid correctness, a bloodless respectability submergence of matter in mannerin brief, what is there is the feeble, uninspiring quality of German painting and English music.”
—H.L. (Henry Lewis)
“That those tribes [the Sac and Fox Indians] cannot exist surrounded by our settlements and in continual contact with our citizens is certain. They have neither the intelligence, the industry, the moral habits, nor the desire of improvement which are essential to any favorable change in their condition.”
—Andrew Jackson (17671845)