Accuracy fools you!

Majid
6 min readOct 19, 2020
Lebron vs Jimmy Butler. Source

It started when I switched tabs to check how Miami Heat was doing against the Lakers. A few turn overs by the Lakers and I switch back to my training notebook when I noticed something weird!!!

my validation accuracy vs epoch

This looks alright. I mean as a passionate newbie I probably wouldn’t want this little plot to turn out any other way. However, you can imagine my consternation when I saw the loss plots:

validation and training losses vs epoch

Huh! Apparently, my little ResNet is overfitting after epoch number four while but how come my accuracy keeps rising? Should I pick the model with the highest accuracy to go on with?

The short answer is NO.

To break this down let’s consider a binary classification task like whether an image is a picture of a cat or not. To make a correct prediction and increase the accuracy, the model would only need to assign a probability over 0.5 to a cat image. Even if the probability comes out 0.5000001 the image will be assigned the label “1” i.e. “cat”. Otherwise put, once the model meets the threshold of 0.5, it can lazily and simply label the image as “cat” without further investigation. What happens to the loss in this case?

An output score of 0.5000001 would increase the accuracy, but will not decrease the loss because (1 — 0.5000001 = 0.4999999) is still relatively large :) .

The other issue is, say the true label is “cat” but the output score is 0.1. In this case, the prediction is evidently “0” or “non-cat”. The number of wrong predictions increases by 1 i.e. the accuracy will decrease a little bit (depending on the total number of predictions). However, the loss will increase enormously. This discrepancy in how much the two are affected may cloud our judgment when picking the best model.

Now two questions:

  1. Are we working with a balanced data set?
  2. How important is it to miss-classify an input?

The first question is important because imagine our dataset consists of 950 cat images and 5 non-cat images. Your classifier would be like “gee! I am not gonna bother. I will predict “cat” all the time and I will end up with 95% accuracy. I won’t even bother to decrease the loss because once I meet the 0.5 criteria I am good to output my prediction.” which is not cool. So accuracy is not a reliable measure here.

To take the second question into consideration, forget about cat predictions and suppose we are training a robot to make profitable purchases in the stock market based on the input data (whatever it is). We don’t want a model that makes decisions with low confidence right? Let’s say for an input the, threshold is met and the score comes out 0.51. In this case, the model is saying “I am not really sure but I guess we‘ll make millions with this purchase. Not reallllllly sure though!?!. Oh what the hell, let’s buy!” and boom, you lose millions to your surprise. While if the robot had let go of that purchase, Although it would be hard to let go of the greed and the possibility of becoming a millionaire, you would have been safe in not having lost anything. In this case, you don’t want to make a purchase unless the robot is very sure. probably with output scores higher than 0.90. Once again the accuracy falls short; this time fails to distinguish between a confident and unconfident robot.

So I am simply deciding based on the loss amounts next time around. Matter of fact, if I had plotted the training accuracy I might have seen a huge gap between training accuracy and validation accuracy. What’s next, is a little opener about what can be used alongside the accuracy to help with the interpretation of results.

Confidence, Specificity, and Sensitivity:

So going forward, I have decided to take a few other things into account besides the accuracy of my model:

Confidence interval

To make this short, think of it as the reliability of your classifier. The more data from the target distribution is the classifier provided with the more experienced it will be. 99% accuracy of a well-versed model would be very different from a naive one. Suppose the accuracy of your model actually comes out 99%. Given that your model would only see a limited number of examples, you can’t expect it to always be 99% accurate though. If you have a good model it will be very close to 99% accurate with a high probability. How close or how high that probability is a matter of investigation. These two are related to each other with this formula:

In which n is the number of observed examples. Our two variables here are z which is associated with the probability of our model being that accurate and interval, which is the tolerance of accuracy:

It is very intuitive that with an increase of probability ‘z’ the interval will increase too. If you increase the interval, there will be a higher chance that your probability falls in the above-mentioned range. The values for z come from statistics and is related to the standard deviation of a Gaussian-like distribution of examples. Common values of z and corresponding probabilities are:

  • (1.64) : 90%
  • (1.96) : 95%
  • (2.33) : 98%
  • (2.58): 99%

Sensitivity

Sensitivity with respect to a class ‘A’ is literally how sensitive our model is to instances with label ‘A’. In other words, upon visiting an instance from class ‘A’, what is the chance of our model actually classifying it as ‘A’:

Specificity

We can define specificity for each class in our prediction. Specificity for class ‘A’ would be a measure of the model’s ability to correctly detect ‘non-A’ instances. In other words, of all the true ‘non-A’ instances, what proportion our model does correctly classify as ‘non-A’:

You can think of scenarios where you would care more about one than the other. In the stock market, an open to risk trader would not mind losing a few hundred if the flip side is winning thousands. Therefore, he/she would be more interested in a sensitive model that can identify positive windows in which a trade could mean profit. On the other end, a conservative trader whose priority is not losing a penny needs a model specifically good at identifying windows where trade could result in losses, even with a low probability.

All in all, I think I personally need to be more careful and really investigate my model’s final results. Since I would probably be studying these matters a bit, let me know if I should expand on an already mentioned topic or look into other ways of examination. Thank you ;)

P.S…who won the series? ;p

--

--