- When: Monday, March 28, 2022 from 11:00 AM to 12:00 PM
- Speakers: Marcelo d'Amorim
- Location: ZOOM only
- Export to iCal
Abstract: In this talk, I will present my work on taming nondeterministic problems in software artifacts, such as test cases and scripts to automate software deployments. Let us focus on tests for a moment. Imagine that you are a Software Engineer and that you pushed code changes to a remote repository, shared with multiple developers. Half an hour later, the integration service notifies you about two tests that have failed during the last execution. You are confused as your changes look simple. At that point, you start to inspect your changes to find an answer that could justify those failures. One hour later, you receive another notification of the integration service. This time, the server indicates that all tests have passed. You start to suspect that the problem was not with the code you changed but, instead, with the non-deterministic behavior of those two tests. After investing some time debugging, you confirm that the two tests are flaky. A flaky test is one that non-deterministically passes or fails in a fixed environment (e.g., machine, OS, etc.). Test flakiness is costly as developers spend precious time to find a problem in the application that may not exist. It is also a serious problem in Industry (e.g., Google, Facebook, Mozilla, Twitter, and Microsoft).
Prior studies have shown that concurrent behavior is the most common cause of test flakiness. Based on that observation, we hypothesize that adding noise in the environment can interfere in the ordering of program events and, consequently, can influence the test outputs. Shaker is a practical technique to detect flakiness by comparing the outputs of tests executed in carefully selected "noisy" environments. Compared with a regular test run, one test run in Shaker is slower as Shaker executes the tests in loaded environments, i.e., the process that runs a test competes for resources (e.g., memory or CPU) with stressor tasks that Shaker creates. However, we conjecture that Shaker pays off by detecting test flakiness in fewer runs compared with the alternative of running the test suite multiple times in a default environment without noise. We refer to that alternative as ReRun.
We evaluated Shaker on a public benchmark of flaky tests for Android applications using standard performance metrics (e.g., precision and recall) and ReRun as a comparison baseline. Results are encouraging. For example, we found that (1) Shaker is 98% precise; it is almost as precise as ReRun, which, by definition, does not report false positives, that (2) Shaker’s recall is much higher compared to ReRun’s (95% versus 65%), and that (3) Shaker detects flaky tests much more efficiently than ReRun, despite the execution overhead associated with the introduction of noise.
In the future, I plan to evaluate other mechanisms to introduce noise in the environment (e.g., resource throttling, test-specific noise generators) and to explore the idea of selectively introducing noise to debug flaky tests (i.e., to explain to the developer why a test is flaky). Shaker paved the way for those ideas.
Short bio: Marcelo d'Amorim is an Associate Professor at the Federal University of Pernambuco (UFPE), Brazil. He obtained his PhD from the University of Illinois at Urbana-Champaign in 2007 and his MS and BS degrees from UFPE in 2001 and 1997, respectively. Marcelo's research goal is to help developers build correct software faster.
Posted 2 years, 8 months ago