Tests should be repeatable and not flaky

A repeatable test gives the same result no matter how many times it is executed. Developers lose their trust in tests that present flaky behavior (sometimes pass and sometimes fail, without any changes in the system or test code).

Flaky tests hurt the productivity of software development teams. It is hard to know whether a flaky test is failing because the behavior is buggy or because it is flaky. Little by little, flaky tests can make us lose confidence in our test suites. Such lack of confidence may lead us to deploy our systems even though the tests fail (they may be broken because of flakiness, not because the system is misbehaving).

The prevalence and impact of flaky tests in the software development world have increased over time (or, at least, we talk more about them now). Companies like Google and Facebook have publicly talked about problems caused by flaky tests.

A test can become flaky for many reasons:

Because it depends on external or shared resources —If a test depends on a database, many things can cause flakiness. For example, the database may not be available at the moment the test is executed, it may contain data that the test does not expect, or two developers may be running the test suite at the same time and sharing the same database, causing one to break the test of the other.
Due to improper time-outs —This is a common reason in web testing. Suppose a test has to wait for something to happen in the system: for example, a request coming back from a web service, which is then displayed in an HTML element. If the web application is slower than normal, the test may fail because it did not wait long enough.
Because of a hidden interaction between different test methods —Test A somehow influences the result of test B, possibly causing it to fail.

The work of Luo et al. (2014) also shed light on the causes of flaky tests. After analyzing 201 flaky tests in open source systems, the authors noticed the following:

Async wait, concurrency, and test order dependency are the three most common causes of flakiness.
Most flaky tests are flaky from the time they are written.
Flaky tests are rarely due to the platform-specifics (they do not fail because of different operating systems).
Flakiness is often due to dependencies on external resources and can be fixed by cleaning the shared state between test runs.

Detecting the cause of a flaky test is challenging. Software engineering researchers have proposed automated tools to detect flaky tests. If you are curious about such tools and the current state of the art, I suggest that you read the following:

The work of Bell et al. (2018), who proposed DeFlaker, a tool that monitors the coverage of the latest code changes and marks a test as flaky if any new failing test did not exercise any of the changed code.
The work of Lam et al. (2019), who proposed iDFlakies, a tool that executes tests in random order, looking for flakiness.

Tests should be repeatable and not flaky

Comments

Leave a Reply Cancel reply