Binary Classification - Evaluation of Binary Classifiers

Evaluation of Binary Classifiers

See also: sensitivity and specificity, precision and recall

To measure the performance of a medical test, the concepts sensitivity and specificity are often used; these concepts are readily usable for the evaluation of any binary classifier. Say we test some people for the presence of a disease. Some of these people have the disease, and our test says they are positive. They are called true positives (TP). Some have the disease, but the test claims they don't. They are called false negatives (FN). Some don't have the disease, and the test says they don't - true negatives (TN). Finally, there might be healthy people who have a positive test result - false positives (FP). Thus, the number of true positives, false negatives, true negatives, and false positives add up to 100% of the set.

Specificity (TNR) is the proportion of people that tested negative (TN) of all the people that actually are negative (TN+FP). As with sensitivity, it can be looked at as the probability that the test result is negative given that the patient is not sick. With higher specificity, fewer healthy people are labeled as sick (or, in the factory case, the less money the factory loses by discarding good products instead of selling them).

Sensitivity (TPR) is the proportion of people that tested positive (TP) of all the people that actually are positive (TP+FN). It can be seen as the probability that the test is positive given that the patient is sick. With higher sensitivity, fewer actual cases of disease go undetected (or, in the case of the factory quality control, the fewer faulty products go to the market).

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve.

In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100% in both (such as in the red/blue ball example given above). In more practical, less contrived instances, however, there is usually a trade-off, such that they are inversely proportional to one another to some extent. This is because we rarely measure the actual thing we would like to classify; rather, we generally measure an indicator of the thing we would like to classify, referred to as a surrogate marker. The reason why 100% is achievable in the ball example is because redness and blueness is determined by directly detecting redness and blueness. However, indicators are sometimes compromised, such as when non-indicators mimic indicators or when indicators are time-dependent, only becoming evident after a certain lag time. The following example of a pregnancy test will make use of such an indicator.

Modern pregnancy tests do not use the pregnancy itself to determine pregnancy status; rather, human chorionic gonadotropin is used, or hCG, present in the urine of gravid females, as a surrogate marker to indicate that a woman is pregnant. Because hCG can also be produced by a tumor, the specificity of modern pregnancy tests cannot be 100% (in that false positives are possible). Also, because hCG is present in the urine in such small concentrations after fertilization and early embryogenesis, the sensitivity of modern pregnancy tests cannot be 100% (in that false negatives are possible).

In addition to sensitivity and specificity, the performance of a binary classification test can be measured with positive (PPV) and negative predictive values (NPV). The positive prediction value answers the question "If the test result is positive, how well does that predict an actual presence of disease?". It is calculated as (true positives) / (true positives + false positives); that is, it is the proportion of true positives out of all positive results. (The negative prediction value is the same, but for negatives, naturally.)

There is one crucial difference between the two concepts: Sensitivity and specificity are independent from the population in the sense that they do not change depending on the tested proportion of positives and negatives. Indeed, the sensitivity of the test can be determined by testing only positive cases. However, the prediction values are dependent on the population.

Read more about this topic:  Binary Classification

Famous quotes containing the words evaluation of and/or evaluation:

    Good critical writing is measured by the perception and evaluation of the subject; bad critical writing by the necessity of maintaining the professional standing of the critic.
    Raymond Chandler (1888–1959)

    Evaluation is creation: hear it, you creators! Evaluating is itself the most valuable treasure of all that we value. It is only through evaluation that value exists: and without evaluation the nut of existence would be hollow. Hear it, you creators!
    Friedrich Nietzsche (1844–1900)