Skip to content

Test Failure Analysis

Justin Searls edited this page Aug 14, 2015 · 13 revisions
Clone this wiki locally

Whenever a test execution fails, the first step in responding to the failure is analyzing the nature of the failure. Every failure can be described as either a "true negative" or a "false negative". In summary:

  • True Negatives are failures indicating a bug in the production code and for which the corrective action is a change to the production code
  • False Negatives are failures indicating the test is out-of-step with the intended behavior of the production code and for which the corrective action is a change to the test

Each occurrence of a false negative erodes the perceived usefulness of a given test suite. If the majority of test failures that developers encounter (outside a normal TDD productivity workflow) simply indicate that tests are out-of-date and need to be updated, then a failing build will be less valuable and testing will come to be seen as a needlessly urgent chore. For teams with very robust test suites, it's not uncommon for morale to suffer as true negatives become exceedingly rare in comparison to false negatives. This is especially true of test suites that aren't designed to mitigate redundant coverage.

Often, analyzing whether a failure was a true or false negative is the most time-intensive step of fixing a broken test. Moreover, the analysis cost tends to increase with the test's degree of integrated-ness, because the causal relationship between the test and the production change that triggered the failure less clear. Unfortunately false negatives tend to be much more common in integration tests than unit tests for the same reason, as most developers will know to update a symmetrical unit test but may not be aware of other tests which indirectly depend on the code being changed.

To illustrate this analysis cost, consider a system that calculates mortgage amortization schedules. Naturally, its integration test suite contains assertions based on real-world examples of known inputs and expected schedules. Suppose that a developer uncovers that the system is rounding values too early in some calculation, so they change the affected unit and its corresponding test, and then push the fix. Next, the build breaks and roughly half of the related integration tests are red. The team must now ask, "Which tests are red because their expectations were generated using the faulty rounding logic? Which tests might indicate an actual problem with the change?" For important financial calculations, this test failure analysis stemming from a relatively simple code change might literally take weeks of effort.

Test Success Analysis

Less worthy of lengthy discussion is "true positive" and "false positive" analysis, since developers rarely spend much time studying why passing tests succeed, but it's worth noting that "false positives" (that is, a test which passes when the code doesn't actually exhibit the intended behavior) are a common problem in their own right. Jim Weirich called these "fantasy tests", and they're especially common when a test exhibits:

  • Liberal, undisciplined use of test doubles in a Detroit-school test, such that actual instances of the subject won't behave as they do when placed under test
  • Generated test cases (e.g. when tests are defined in the body of a loop over an array or hash of test data) which, ironically, rarely test the generation logic itself (this is very common in JavaScript tests because of a common gotcha in var scoping and loops)