Anomaly detection methods have demonstrated remarkable success across various applications. However, assessing their performance, particularly at the pixel-level, presents a complex challenge due to the severe imbalance that is most commonly present between normal and abnormal samples. Commonly adopted evaluation metrics designed for pixel-level detection may not effectively capture the nuanced performance variations arising from this class imbalance. In this paper, we dissect the intricacies of this challenge, underscored by visual evidence and statistical analysis, leading to delve into the need for evaluation metrics that account for the imbalance. We offer insights into more accurate metrics, using eleven leading contemporary anomaly detection methods on twenty-one anomaly detection problems. Overall, from this extensive experimental evaluation, we can conclude that Precision-Recall-based metrics can better capture relative method performance, making them more suitable for the task.