Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Wolfers

Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping

Jun 09, 2025

Nitin Sharma, Thomas Wolfers, Çağatay Yıldız

Abstract:The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.

* 35 pages, 24 figures. First submission

Via

Access Paper or Ask Questions

Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection on 3D Cortical Surfaces

Dec 07, 2024

Hao-Chun Yang, Sicheng Dai, Saige Rutherford, Christian Gaser Andre F Marquand, Christian F Beckmann, Thomas Wolfers

Figure 1 for Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection on 3D Cortical Surfaces

Figure 2 for Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection on 3D Cortical Surfaces

Figure 3 for Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection on 3D Cortical Surfaces

Abstract:Unsupervised anomaly detection in brain imaging is challenging. In this paper, we propose a self-supervised masked mesh learning for unsupervised anomaly detection in 3D cortical surfaces. Our framework leverages the intrinsic geometry of the cortical surface to learn a self-supervised representation that captures the underlying structure of the brain. We introduce a masked mesh convolutional neural network (MMN) that learns to predict masked regions of the cortical surface. By training the MMN on a large dataset of healthy subjects, we learn a representation that captures the normal variation in the cortical surface. We then use this representation to detect anomalies in unseen individuals by calculating anomaly scores based on the reconstruction error of the MMN. We evaluate our framework by training on population-scale dataset UKB and HCP-Aging and testing on two datasets of Alzheimer's disease patients ADNI and OASIS3. Our results show that our framework can detect anomalies in cortical thickness, cortical volume, and cortical sulcus features, which are known to be sensitive biomarkers for Alzheimer's disease. Our proposed framework provides a promising approach for unsupervised anomaly detection based on normative variation of cortical features.

Via

Access Paper or Ask Questions

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Jul 26, 2024

Saige Rutherford, Thomas Wolfers, Charlotte Fraza, Nathaniel G. Harrnet, Christian F. Beckmann, Henricus G. Ruhe, Andre F. Marquand

Figure 1 for To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Figure 2 for To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Figure 3 for To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Figure 4 for To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Abstract:Reference classes in healthcare establish healthy norms, such as pediatric growth charts of height and weight, and are used to chart deviations from these norms which represent potential clinical risk. How the demographics of the reference class influence clinical interpretation of deviations is unknown. Using normative modeling, a method for building reference classes, we evaluate the fairness (racial bias) in reference models of structural brain images that are widely used in psychiatry and neurology. We test whether including race in the model creates fairer models. We predict self-reported race using the deviation scores from three different reference class normative models, to better understand bias in an integrated, multivariate sense. Across all of these tasks, we uncover racial disparities that are not easily addressed with existing data or commonly used modeling techniques. Our work suggests that deviations from the norm could be due to demographic mismatch with the reference class, and assigning clinical meaning to these deviations should be done with caution. Our approach also suggests that acquiring more representative samples is an urgent research priority.

Via

Access Paper or Ask Questions

Learning Cortical Anomaly through Masked Encoding for Unsupervised Heterogeneity Mapping

Dec 11, 2023

Hao-Chun Yang, Ole Andreassen, Lars Tjelta Westlye, Andre F. Marquand, Christian F. Beckmann, Thomas Wolfers

Abstract:The detection of heterogeneous mental disorders based on brain readouts remains challenging due to the complexity of symptoms and the absence of reliable biomarkers. This paper introduces CAM (Cortical Anomaly Detection through Masked Image Modeling), a novel self-supervised framework designed for the unsupervised detection of complex brain disorders using cortical surface features. We employ this framework for the detection of individuals on the psychotic spectrum and demonstrate its capabilities compared to state-ofthe-art methods, achieving an AUC of 0.696 for Schizoaffective and 0.769 for Schizophreniform, without the need for any labels. Furthermore, the analysis of atypical cortical regions includes Pars Triangularis and several frontal areas, often implicated in schizophrenia, provide further confidence in our approach. Altogether, we demonstrate a scalable approach for anomaly detection of complex brain disorders based on cortical abnormalities.

* 5 pages, LaTeX; typos corrected, introduction refined

Via

Access Paper or Ask Questions

Unsupervised Anomaly Detection using Aggregated Normative Diffusion

Dec 04, 2023

Alexander Frotscher, Jaivardhan Kapoor, Thomas Wolfers, Christian F. Baumgartner

Abstract:Early detection of anomalies in medical images such as brain MRI is highly relevant for diagnosis and treatment of many conditions. Supervised machine learning methods are limited to a small number of pathologies where there is good availability of labeled data. In contrast, unsupervised anomaly detection (UAD) has the potential to identify a broader spectrum of anomalies by spotting deviations from normal patterns. Our research demonstrates that existing state-of-the-art UAD approaches do not generalise well to diverse types of anomalies in realistic multi-modal MR data. To overcome this, we introduce a new UAD method named Aggregated Normative Diffusion (ANDi). ANDi operates by aggregating differences between predicted denoising steps and ground truth backwards transitions in Denoising Diffusion Probabilistic Models (DDPMs) that have been trained on pyramidal Gaussian noise. We validate ANDi against three recent UAD baselines, and across three diverse brain MRI datasets. We show that ANDi, in some cases, substantially surpasses these baselines and shows increased robustness to varying types of anomalies. Particularly in detecting multiple sclerosis (MS) lesions, ANDi achieves improvements of up to 178% in terms of AUPRC.

Via

Access Paper or Ask Questions

Hierarchical Bayesian Regression for Multi-Site Normative Modeling of Neuroimaging Data

May 25, 2020

Seyed Mostafa Kia, Hester Huijsdens, Richard Dinga, Thomas Wolfers, Maarten Mennes, Ole A. Andreassen, Lars T. Westlye, Christian F. Beckmann, Andre F. Marquand

Figure 1 for Hierarchical Bayesian Regression for Multi-Site Normative Modeling of Neuroimaging Data

Figure 2 for Hierarchical Bayesian Regression for Multi-Site Normative Modeling of Neuroimaging Data

Figure 3 for Hierarchical Bayesian Regression for Multi-Site Normative Modeling of Neuroimaging Data

Figure 4 for Hierarchical Bayesian Regression for Multi-Site Normative Modeling of Neuroimaging Data

Abstract:Clinical neuroimaging has recently witnessed explosive growth in data availability which brings studying heterogeneity in clinical cohorts to the spotlight. Normative modeling is an emerging statistical tool for achieving this objective. However, its application remains technically challenging due to difficulties in properly dealing with nuisance variation, for example due to variability in image acquisition devices. Here, in a fully probabilistic framework, we propose an application of hierarchical Bayesian regression (HBR) for multi-site normative modeling. Our experimental results confirm the superiority of HBR in deriving more accurate normative ranges on large multi-site neuroimaging data compared to widely used methods. This provides the possibility i) to learn the normative range of structural and functional brain measures on large multi-site data; ii) to recalibrate and reuse the learned model on local small data; therefore, HBR closes the technical loop for applying normative modeling as a medical tool for the diagnosis and prognosis of mental disorders.

* To be published in MICCAI 2020 proceedings

Via

Access Paper or Ask Questions