Test Analytics – Measuring Automated Test Coverage – Part 2: The Mutants20/09/2016
Unlike with manual testing, it is a given that automated test suites are by nature, only going to detect defects within the bounds of the validations they have been setup to perform.
Measuring code coverage on your automated tests is great, but it is only part of the picture. Code coverage only tells you the logic and branches which have been executed, it doesn’t really measure whether your tests are getting a lot of data coverage, and it doesn’t tell you whether your tests are effectively detecting failures. As we talked about in our posts on API Banking, and Agile Test Data – test data is an area which is so often overlooked.
One technique that starts to give you better insight into how comprehensively your test data inputs and outputs leverage the software under test is mutation testing.
Validation of the system response (effectively expected results) is a critical part of implementing automated testing, it is straightforward to check that a variable is sensible, however as interfaces get more complex (think of an XML message or a user interface) the amount of design subjectivity around the validations increases.
Of course, it is hard to determine if you can detect failure until the failures occur, something familiar to every test manager as they deal with the learning from production releases and feed it back into the test process. Surely there is a better way to find problems with test coverage?
Well there is. Mutation testing can help with reviewing test automation to understand how good your coverage really is. Mutation Testing is the process of running the automated tests many times, each on a slightly different version of the software under test. Each version contains defects injected based on heuristics – hence they are called Mutants – each version is a slightly altered version of the original software. In Piccadilly Labs we’ve been using Pitest, a great tool for mutation testing of Java software.
If your tests fail, the Mutant is “Killed”, otherwise it is a “Survivor” and more importantly, an opportunity to improve your tests.
An example of a mutation is injecting an “off by one” error, or reversing multiplication to be division – common programming errors.
Mutation is rather resource consuming, so we use our build server to run jobs overnight. We’ve got Jenkins and Sonar setup with an array of test tools, on a t2.medium EC2 server (a medium sized server!). Every time we check changes into source control, the application we were working on in Part 1 is run through a standard and mutation test on there by Jenkins. It takes about 3-4 minutes to run a mutation testing process on the Exhauster app – which is quite a long time in the world of Continuous Integration.
The results are pretty concerning, only 39% of defects injected were detected by the tests. There are over 40 mutations to investigate.
The results immediately drawn attention to one method – getMemoryUsage() – which represents a quarter of the surviving mutants. This has 100% of lines covered by automated tests, both the unit and cucumber tests are covering it.
Let’s take a look at the results from testing those tests using mutation testing. You can see line 137 (the really long one building a string) is red. The numbers at the left of the line mean 12 mutations have been attempted on this line, 10 of them survived! That is pretty weak coverage.
So what’s going on? Well there are 12 operators (e.g. divide by, subtract) in the line of code, either doing maths or joining the string together. Each of the twelve mutants changed ONE of those operators to a different one. So multiply instead of divide, etc. Only 2 of these changes were detected by tests.
Let’s take a look at why. First the unit test, it clearly doesn’t do much. It really just checks the response starts with “Heap”. As we’ve seen, there is a lot more to check!
That explains what we need to fix at the unit level, but what about the end to end functional test? That covers memory usage as we saw earlier:
Again, only part of the response is being verified. Let’s take a look at the test code invoked by the above tests.
There is a rather complex line of code doing concatenations and calculations, which is barely covered at all. The static text isn’t checked to be present, the calculations aren’t verified. To kill 10/12 of the mutants required 50 lines of test code, at first pass. This seems a fairly expensive trade off, and it would maybe be more effective to refactor the code under test!
It is important to apply a value test. The 50 lines of code we wrote to test a complicated string being sent to stdout, was just for logging. In the real world that does not make sense unless that log is super critical. To reduce the test code, we could decide that calls to all forms of logging, were not mutated, as errors would not impact core functionality. Pitest contains configuration for just that type of scenario:
This helps to narrow our focus, because as you have seen above, it is easy to do down rabbit holes when trying to fix your coverage. We recommend that you set clear principles around your tests, such as whether you want to cover exceptions and logging, and also that you put a business lens on your priorities before you start chasing mutants.
Adam Smith – Director