Abstract:Foundation language model deployments often include auxiliary guard-rail models to filter or classify text, detecting jailbreak attempts, biased or toxic output, or ensuring topic adherence. These additional models increase the complexity and cost of model inference, especially since many are also large language models. To address this issue, we explore training-free model merging techniques to consolidate these models into a single, multi-functional model. We propose Heterogeneous Multi-Class Model Merging (HM3) as a simple technique for merging multi-class classifiers with heterogeneous label spaces. Unlike parameter-efficient fine-tuning techniques like LoRA, which require extensive training and add complexity during inference, recent advancements allow models to be merged in a training-free manner. We report promising results for merging BERT-based guard models, some of which attain an average F1-score higher than the source models while reducing the inference time by up to 44%. We introduce self-merging to assess the impact of reduced task-vector density, finding that the more poorly performing hate speech classifier benefits from self-merging while higher-performing classifiers do not, which raises questions about using task vector reduction for model tuning.
Abstract:The emergence of large language models (LLMs) has revolutionized numerous applications across industries. However, their "black box" nature often hinders the understanding of how they make specific decisions, raising concerns about their transparency, reliability, and ethical use. This study presents a method to improve the explainability of LLMs by varying individual words in prompts to uncover their statistical impact on the model outputs. This approach, inspired by permutation importance for tabular data, masks each word in the system prompt and evaluates its effect on the outputs based on the available text scores aggregated over multiple user inputs. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores, which enables decomposing the importance of words into the specific measures of interest--including bias, reading level, verbosity, etc. This procedure also enables measuring impact when attention weights are not available. To test the fidelity of this approach, we explore the effect of adding different suffixes to multiple different system prompts and comparing subsequent generations with different large language models. Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.