Mutation testing is widely used in experiments. Some papers experiment with mutation directly, while others use it to introduce faults to measure the effectiveness of tests created by other methods. There is some random variation in the mutation score depending on the specific test values used. When generating tests to use in experiments, a common, although not universal practice, is to generate multiple sets of tests to satisfy the same criterion or according to the same procedure, and then to compute their average performance. Averaging over multiple test sets is thought to reduce the variation in the mutation score. This practice is extremely expensive when tests are generated by hand (as is common) and as the number of programs increase (a current positive trend in software engineering experimentation). The research reported in this short paper asks a simple and direct question: do we need to generate multiple sets of test cases? That is, how do different test sets influence the cost and effectiveness results? In a controlled experiment, we generated 10 different test cases to be adequate for the Statement Deletion (SSDL) mutation operator for 39 small programs and functions, and then evaluated how they differ in terms of cost and effectiveness. We found that averaging over multiple programs was effective in reducing the variance in the mutation scores introduced by specific tests.
Back to my home page.