Abstract:Ensuring reliable data collection in large-scale particle physics experiments demands Data Quality Monitoring (DQM) procedures to detect possible detector malfunctions and preserve data integrity. Traditionally, this resource-intensive task has been handled by human shifters that struggle with frequent changes in operational conditions. We present novel, interpretable, robust, and scalable DQM algorithms designed to automate anomaly detection in time-dependent settings. Our approach constructs evolving histogram templates with built-in uncertainties, featuring both a statistical variant - extending the classical Exponentially Weighted Moving Average (EWMA) - and a machine learning (ML)-enhanced version that leverages a transformer encoder for improved adaptability. Experimental validations on synthetic datasets demonstrate the high accuracy, adaptability, and interpretability of these methods, with the statistical variant being commissioned in the LHCb experiment at the Large Hadron Collider, underscoring its real-world impact. The code used in this study is available at https://github.com/ArseniiGav/DINAMO.
Abstract:Data Quality Monitoring (DQM) is a crucial task in large particle physics experiments, since detector malfunctioning can compromise the data. DQM is currently performed by human shifters, which is costly and results in limited accuracy. In this work, we provide a proof-of-concept for applying human-in-the-loop Reinforcement Learning (RL) to automate the DQM process while adapting to operating conditions that change over time. We implement a prototype based on the Proximal Policy Optimization (PPO) algorithm and validate it on a simplified synthetic dataset. We demonstrate how a multi-agent system can be trained for continuous automated monitoring during data collection, with human intervention actively requested only when relevant. We show that random, unbiased noise in human classification can be reduced, leading to an improved accuracy over the baseline. Additionally, we propose data augmentation techniques to deal with scarce data and to accelerate the learning process. Finally, we discuss further steps needed to implement the approach in the real world, including protocols for periodic control of the algorithm's outputs.