An Empirical Study of Flaky Tests in Python

Gruber, MartinLukasczyk, StephanKroiß, FlorianFraser, GordonGrunske, LarsSiegmund, JanetVogelsang, Andreas2022-01-192022-01-192022978-3-88579-714-2https://dl.gi.de/handle/20.500.12116/37998This is a summary of our work presented at the International Conference on Software Testing 2021 [Gr21b]. Tests that cause spurious failures without code changes, i.e., flaky tests, hamper regression testing and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, raising questions about generalizability. To provide a better understanding of flakiness, we empirically study the prevalence, causes, and degree of flakiness within 22 352 Python projects containing 876 186 tests. We found flakiness to be equally prevalent in Python as in Java. The reasons, however, are different: Order dependency is a dominant problem, causing 59% of the 7 571 flaky tests we found. Another 28% were caused by test infrastructure problems, a previously less considered cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs. Unveiling flaky tests also requires more runs than often assumed: A 95% confidence that a passing test is not flaky on average would require 170 reruns. Additionally, through our investigations, we created a large dataset of flaky tests that other researchers already started building on [MM21; Ni21].enFlaky TestPythonEmpirical StudyAn Empirical Study of Flaky Tests in PythonText/Conference Paper10.18420/se2022-ws-0091617-5468