Abstract:Automatic pain intensity estimation plays a pivotal role in healthcare and medical fields. While many methods have been developed to gauge human pain using behavioral or physiological indicators, facial expressions have emerged as a prominent tool for this purpose. Nevertheless, the dependence on labeled data for these techniques often renders them expensive and time-consuming. To tackle this, we introduce the Adaptive Hierarchical Spatio-temporal Dynamic Image (AHDI) technique. AHDI encodes spatiotemporal changes in facial videos into a singular RGB image, permitting the application of simpler 2D deep models for video representation. Within this framework, we employ a residual network to derive generalized facial representations. These representations are optimized for two tasks: estimating pain intensity and differentiating between genuine and simulated pain expressions. For the former, a regression model is trained using the extracted representations, while for the latter, a binary classifier identifies genuine versus feigned pain displays. Testing our method on two widely-used pain datasets, we observed encouraging results for both tasks. On the UNBC database, we achieved an MSE of 0.27 outperforming the SOTA which had an MSE of 0.40. On the BioVid dataset, our model achieved an accuracy of 89.76%, which is an improvement of 5.37% over the SOTA accuracy. Most notably, for distinguishing genuine from simulated pain, our accuracy stands at 94.03%, marking a substantial improvement of 8.98%. Our methodology not only minimizes the need for extensive labeled data but also augments the precision of pain evaluations, facilitating superior pain management.