This study explores using Natural Language Processing (NLP) to analyze candidate comments for identifying problematic test items. We developed and validated machine learning models that automatically identify relevant negative feedback, evaluated approaches of incorporating psychometric features enhances model performance, and compared NLP-flagged items with traditionally flagged items. Results demonstrate that candidate feedback provides valuable complementary information to statistical methods, potentially improving test validity while reducing manual review burden. This research offers testing organizations an efficient mechanism to incorporate direct candidate experience into quality assurance processes.