Be wary of evaluations of AI classifiers that focus only on predictive accuracy as a single percentage. Instead, as a bare minimum, we need to separate quantification of how well AI is doing when it says yes and when it says no.

Predictive accuracy is strongly determined by how common the true yes and no answers are. If a system is designed to try to diagnose a rare disease, the easiest way to increase accuracy is always to predict no, since that prediction will be correct for most cases.

Instead, we need to know how frequently is AI missing diseases that have a treatment and how often it is over-diagnosing and leading to potentially harmful overtreatment.

Another application of AI that we have all been using for decades is spam detection. In this domain, we need to know how often a spam filter is exposing us to cryptocurrency grifters (for example) and how often it is quietly deleting important emails.

AI systems allow a threshold to be tweaked that pushes them towards under- or over-diagnosis. How you choose the threshold depends on what action you will take as a result of the classification, e.g., will users just be exposed to a little extra spam that they can swiftly delete or will they be subjected to unnecessary chemotherapy.