Index of Coincidence - Application

Application

The index of coincidence is useful both in the analysis of natural-language plaintext and in the analysis of ciphertext (cryptanalysis). Even when only ciphertext is available for testing and plaintext letter identities are disguised, coincidences in ciphertext can be caused by coincidences in the underlying plaintext. This technique is used to cryptanalyze the Vigenère cipher, for example. For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system.

Coincidence counting can help determine when two texts are written in the same language using the same alphabet. (This technique has been used to examine the purported Bible code). The causal coincidence count for such texts will be distinctly higher than the accidental coincidence count for texts in different languages, or texts using different alphabets, or gibberish texts.

To see why, imagine an "alphabet" of only the two letters A and B. Suppose that in our "language", the letter A is used 75% of the time, and the letter B is used 25% of the time. If two texts in this language are laid side by side, then the following pairs can be expected:

Pair Probability
AA 56.25%
BB 6.25%
AB 18.75%
BA 18.75%

Overall, the probability of a "coincidence" is 62.5% (56.25% for AA + 6.25% for BB).

Now consider the case when both messages are encrypted using the simple monoalphabetic substitution cipher which replaces A with B and vice versa:

Pair Probability
AA 6.25%
BB 56.25%
AB 18.75%
BA 18.75%

The overall probability of a coincidence in this situation is 62.5% (6.25% for AA + 56.25% for BB), exactly the same as for the unencrypted "plaintext" case. In effect, the new alphabet produced by the substitution is just a uniform renaming of the original character identities, which does not affect whether they match.

Now suppose that only one message (say, the second) is encrypted using the same substitution cipher (A,B)→(B,A). The following pairs can now be expected:

Pair Probability
AA 18.75%
BB 18.75%
AB 56.25%
BA 6.25%

Now the probability of a coincidence is only 37.5% (18.75% for AA + 18.75% for BB). This is noticeably lower than the probability when same-language, same-alphabet texts were used. Evidently, coincidences are more likely when the most frequent letters in each text are the same.

The same principle applies to real languages like English, because certain letters, like E, occur much more frequently than other letters—a fact which is used in frequency analysis of substitution ciphers. Coincidences involving the letter E, for example, are relatively likely. So when any two English texts are compared, the coincidence count will be higher than when an English text and a foreign-language text are used.

It can easily be imagined that this effect can be subtle. For example, similar languages will have a higher coincidence count than dissimilar languages. Also, it isn't hard to generate random text with a frequency distribution similar to real text, artificially raising the coincidence count. Nevertheless, this technique can be used effectively to identify when two texts are likely to contain meaningful information in the same language using the same alphabet, to discover periods for repeating keys, and to uncover many other kinds of nonrandom phenomena within or among ciphertexts.

The same idea can be applied to a single text, where the sample is in effect compared with itself. Mathematically we can compute the index of coincidence IC for a given letter-frequency distribution as

where is the length of the text and through are the frequencies (as integers) of the letters of the alphabet ( for monocase English). The sum of the is necessarily .

The products count the number of combinations of elements taken two at a time. (Actually this counts each pair twice; the extra factors of 2 occur in both numerator and denominator of the formula and thus cancel out.) Each of the occurrences of the -th letter matches each of the remaining occurrences of the same letter. There are a total of letter pairs in the entire text, and is the probability of a match for each pair, assuming a uniform random distribution of the characters (the "null model"; see below). Thus, this formula gives the ratio of the total number of coincidences observed to the total number of coincidences that one would expect from the null model.

The expected average value for the I.C. can be computed from the relative letter frequencies of the source language:

If all letters of an alphabet were equally distributed, the expected index would be 1.0. The actual monographic I.C. for telegraphic English text is around 1.73, reflecting the unevenness of natural-language letter distributions. Expected values for various languages are:

Language Index of Coincidence
English 1.73
French 2.02
German 2.05
Italian 1.94
Portuguese 1.94
Russian 1.76
Spanish 1.94

Sometimes similar values are reported without the normalizing denominator, for example for English; such values may be called ("kappa-plaintext") rather than "I.C.", with ("kappa-random") used to denote the denominator (which is the expected coincidence rate for a uniform distribution of the same alphabet, for English).

Read more about this topic:  Index Of Coincidence

Famous quotes containing the word application:

    I think that a young state, like a young virgin, should modestly stay at home, and wait the application of suitors for an alliance with her; and not run about offering her amity to all the world; and hazarding their refusal.... Our virgin is a jolly one; and tho at present not very rich, will in time be a great fortune, and where she has a favorable predisposition, it seems to me well worth cultivating.
    Benjamin Franklin (1706–1790)

    Courage is resistance to fear, mastery of fear—not absence of fear. Except a creature be part coward it is not a compliment to say it is brave; it is merely a loose application of the word. Consider the flea!—incomparably the bravest of all the creatures of God, if ignorance of fear were courage.
    Mark Twain [Samuel Langhorne Clemens] (1835–1910)

    It is known that Whistler when asked how long it took him to paint one of his “nocturnes” answered: “All of my life.” With the same rigor he could have said that all of the centuries that preceded the moment when he painted were necessary. From that correct application of the law of causality it follows that the slightest event presupposes the inconceivable universe and, conversely, that the universe needs even the slightest of events.
    Jorge Luis Borges (1899–1986)