Why do some people hate code coverage?

I find it interesting that some people rage against code coverage. A prevalent opinion is, “If I write a test case with no assertions, I achieve 100% coverage, but I am not testing anything!” This is true. If your tests have no assertions, they do not test anything, but the production code is exercised. However, I consider that a flawed argument. It assumes the very worst (unrealistic) scenario possible. If you are writing test suites with no assertions, you have bigger problems to take care of before you can enjoy the benefits of structural testing.

Between the lines, people use such an argument to explain that you should not look at the coverage number blindly, because it can mislead you. That I fully agree with. Here, the misconception is how people see code coverage. If code coverage is only a number you should achieve, you may end up writing less useful test cases and gaming the metric (something that Bouwers, Visser, and Van Deursen have argued in 2012).

I hope the clarified how structural testing and code coverage should be used: to augment specification-based testing, quickly identify parts of the code that are not currently exercised by the test suite, and identify partitions you missed when doing specification-based testing. Achieving a high coverage number may be a consequence of you doing that, but the purpose is different. If you leave a line uncovered, it is because you thought about it and decided not to cover it.

Empirical evidence in favor of code coverage

Understanding whether structural coverage helps and whether high coverage numbers lead to better-tested software has been the goal of many empirical software engineering researchers. Interestingly, while researchers have not yet found a magical coverage number that we should aim for, some evidence points toward the benefits of structural testing. I quote four of these studies:

Hutchins et al. (1994) —“Within the limited domain of our experiments, test sets achieving coverage levels over 90% usually showed significantly better fault detection than randomly chosen test sets of the same size. In addition, significant improvements in the effectiveness of coverage-based tests usually occurred as coverage increased from 90% to 100%. However, the results also indicate that 100% code coverage alone is not a reliable indicator of the effectiveness of a test set.”
Namin and Andrews (2009) —“Our experiments indicate that coverage is sometimes correlated with effectiveness when test suite size is controlled for, and that using both size and coverage yields a more accurate prediction of effectiveness than test suite size alone. This, in turn, suggests that both size and coverage are important to test suite effectiveness.”
Inozemtseva and Holmes (2014) —“We found that there is a low to moderate correlation between coverage and effectiveness when the number of test cases in the suite is controlled for. In addition, we found that stronger forms of coverage do not provide greater insight into the effectiveness of the suite. Our results suggest that coverage, while useful for identifying under-tested parts of a program, should not be used as a quality target because it is not a good indicator of test suite effectiveness.”
Gopinath et al. (2020) —“This paper finds a correlation between lightweight, widely available coverage criteria (statement, block, branch, and path coverage) and mutation kills for hundreds of Java programs (…). For both original and generated suites, statement coverage is the best predictor for mutation kills, and in fact does a relatively good job of predicting suite quality.”

Although developing sound experiments to show whether coverage helps is difficult, and we are not quite there yet (see Chen et al.’s 2020 paper for a good statistical explanation of why it is hard), the current results make sense to me. Even with the small code examples we have been exploring, we can see a relationship between covering all the partitions via specification-based testing and covering the entire source code. The opposite is also true: if you cover a significant part of the source code, you also cover most of the partitions. Therefore, high coverage implies more partitions being tested.

The empirical results also show that coverage alone is not always a strong indicator of how good a test suite is. We also noticed that in the test cases we derived for the CountWords problem at the beginning. We purposefully did bad specification-based testing and then augmented the test suite with structural testing. We ended up with three test cases that achieve 100% condition + branch coverage. But is the test suite strong enough? I don’t think so. I can think of many extra test cases that would touch the same lines and branches again but would nonetheless make the test suite much more effective against possible bugs.

On the other hand, although 100% coverage does not necessarily mean the system is properly tested, having very low coverage does mean your system is not properly tested. Having a system with, say, 10% coverage means there is much to be done as far as testing.

I suggest reading Google’s code coverage best practices (Arguelles, Ivankovic, and Bender, 2020). Their perceptions are in line with everything we have discussed here.

Why do some people hate code coverage?

Comments

Leave a Reply Cancel reply