Abstract:Background: Fairness testing for deep learning systems has been becoming increasingly important. However, much work assumes perfect context and conditions from the other parts: well-tuned hyperparameters for accuracy; rectified bias in data, and mitigated bias in the labeling. Yet, these are often difficult to achieve in practice due to their resource-/labour-intensive nature. Aims: In this paper, we aim to understand how varying contexts affect fairness testing outcomes. Method:We conduct an extensive empirical study, which covers $10,800$ cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. We also study why the outcomes were observed from the lens of correlation/fitness landscape analysis. Results: Our results show that different context types and settings generally lead to a significant impact on the testing, which is mainly caused by the shifts of the fitness landscape under varying contexts. Conclusions: Our findings provide key insights for practitioners to evaluate the test generators and hint at future research directions.