How to measure and compare classifiers

How do we decide which classifier is best? Rarely do we find the perfect solution, the model that never makes any mistakes, so we need to decide which one to use. We used accuracy before, but sometimes it will be better to optimize so that the model makes fewer errors of a specific kind. For example, in spam filtering, it may be worse to delete a good email than to erroneously let a bad email through. In that case, we may want to choose a model that is conservative in throwing out emails rather than the one that just makes the fewest mistakes overall. We can discuss these issues in terms of gain (which we want to maximize) or loss (which we want to minimize). They are equivalent, but sometimes one is more convenient than the other and you will read articles discussing minimizing the loss or maximizing the gain.

In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests to confirm this or unnecessary treatment (which can still have costs, including side effects from the treatment, but are often less serious than missing a diagnostic). Therefore, depending on the exact setting, different trade-offs can make sense. In one extreme, if the disease is fatal and the treatment is cheap with very few negative side effects, then you want to minimize false negatives as much as you can.

What the gain/cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes, achieving the highest accuracy. However, if some mistakes are costlier than others, it might be better to accept a lower overall accuracy to minimize the overall costs.