Abstract:Artificial intelligence-based language generators are now a part of most people's lives. However, by default, they tend to generate "average" language without reflecting the ways in which people differ. Here, we propose a lightweight modification to the standard language model transformer architecture - "PsychAdapter" - that uses empirically derived trait-language patterns to generate natural language for specified personality, demographic, and mental health characteristics (with or without prompting). We applied PsychAdapters to modify OpenAI's GPT-2, Google's Gemma, and Meta's Llama 3 and found generated text to reflect the desired traits. For example, expert raters evaluated PsychAdapter's generated text output and found it matched intended trait levels with 87.3% average accuracy for Big Five personalities, and 96.7% for depression and life satisfaction. PsychAdapter is a novel method to introduce psychological behavior patterns into language models at the foundation level, independent of prompting, by influencing every transformer layer. This approach can create chatbots with specific personality profiles, clinical training tools that mirror language associated with psychological conditionals, and machine translations that match an authors reading or education level without taking up LLM context windows. PsychAdapter also allows for the exploration psychological constructs through natural language expression, extending the natural language processing toolkit to study human psychology.
Abstract:Use of large language models such as ChatGPT (GPT-4) for mental health support has grown rapidly, emerging as a promising route to assess and help people with mood disorders, like depression. However, we have a limited understanding of GPT-4's schema of mental disorders, that is, how it internally associates and interprets symptoms. In this work, we leveraged contemporary measurement theory to decode how GPT-4 interrelates depressive symptoms to inform both clinical utility and theoretical understanding. We found GPT-4's assessment of depression: (a) had high overall convergent validity (r = .71 with self-report on 955 samples, and r = .81 with experts judgments on 209 samples); (b) had moderately high internal consistency (symptom inter-correlates r = .23 to .78 ) that largely aligned with literature and self-report; except that GPT-4 (c) underemphasized suicidality's -- and overemphasized psychomotor's -- relationship with other symptoms, and (d) had symptom inference patterns that suggest nuanced hypotheses (e.g. sleep and fatigue are influenced by most other symptoms while feelings of worthlessness/guilt is mostly influenced by depressed mood).
Abstract:Mental health issues widely vary across individuals - the manifestations of signs and symptoms can be fairly heterogeneous. Recently, language-based depression and anxiety assessments have shown promise for capturing this heterogeneous nature by evaluating a patient's own language, but such approaches require a large sample of words per person to be accurate. In this work, we introduce adaptive language-based assessment - the task of iteratively estimating an individual's psychological score based on limited language responses to questions that the model also decides to ask. To this end, we explore two statistical learning-based approaches for measurement/scoring: classical test theory (CTT) and item response theory (IRT). We find that using adaptive testing in general can significantly reduce the number of questions required to achieve high validity (r ~ 0.7) with standardized tests, bringing down from 11 total questions down to 3 for depression and 5 for anxiety. Given the combinatorial nature of the problem, we empirically evaluate multiple strategies for both the ordering and scoring objectives, introducing two new methods: a semi-supervised item response theory based method (ALIRT), and a supervised actor-critic based model. While both of the models achieve significant improvements over random and fixed orderings, we find ALIRT to be a scalable model that achieves the highest accuracy with lower numbers of questions (e.g. achieves Pearson r ~ 0.93 after only 3 questions versus asking all 11 questions). Overall, ALIRT allows prompting a reduced number of questions without compromising accuracy or overhead computational costs.