Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Julia E. Vogt

From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound

Aug 20, 2025

Max Krähenmann, Sergio Tascon-Morales, Fabian Laumer, Julia E. Vogt, Ece Ozkan

Abstract:Volumetric ultrasound has the potential to significantly improve diagnostic accuracy and clinical decision-making, yet its widespread adoption remains limited by dependence on specialized hardware and restrictive acquisition protocols. In this work, we present a novel unsupervised framework for reconstructing 3D anatomical structures from freehand 2D transvaginal ultrasound (TVS) sweeps, without requiring external tracking or learned pose estimators. Our method adapts the principles of Gaussian Splatting to the domain of ultrasound, introducing a slice-aware, differentiable rasterizer tailored to the unique physics and geometry of ultrasound imaging. We model anatomy as a collection of anisotropic 3D Gaussians and optimize their parameters directly from image-level supervision, leveraging sensorless probe motion estimation and domain-specific geometric priors. The result is a compact, flexible, and memory-efficient volumetric representation that captures anatomical detail with high spatial fidelity. This work demonstrates that accurate 3D reconstruction from 2D ultrasound images can be achieved through purely computational means, offering a scalable alternative to conventional 3D systems and enabling new opportunities for AI-assisted analysis and diagnosis.

Via

Access Paper or Ask Questions

From Pixels to Perception: Interpretable Predictions via Instance-wise Grouped Feature Selection

May 09, 2025

Moritz Vandenhirtz, Julia E. Vogt

Abstract:Understanding the decision-making process of machine learning models provides valuable insights into the task, the data, and the reasons behind a model's failures. In this work, we propose a method that performs inherently interpretable predictions through the instance-wise sparsification of input images. To align the sparsification with human perception, we learn the masking in the space of semantically meaningful pixel regions rather than on pixel-level. Additionally, we introduce an explicit way to dynamically determine the required level of sparsity for each instance. We show empirically on semi-synthetic and natural image datasets that our inherently interpretable classifier produces more meaningful, human-understandable predictions than state-of-the-art benchmarks.

* International Conference on Machine Learning

Via

Access Paper or Ask Questions

Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

Mar 24, 2025

Boqi Chen, Cédric Vincent-Cuaz, Lydia A. Schoenpflug, Manuel Madeira, Lisa Fournier, Vaishnavi Subramanian, Sonali Andani, Samuel Ruiperez-Campillo, Julia E. Vogt, Raphaëlle Luisier(+5 more)

Abstract:Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.

* MICCAI 2025

Via

Access Paper or Ask Questions

From Pixels to Components: Eigenvector Masking for Visual Representation Learning

Feb 10, 2025

Alice Bizeul, Thomas Sutter, Alain Ryser, Bernhard Schölkopf, Julius von Kügelgen, Julia E. Vogt

Figure 1 for From Pixels to Components: Eigenvector Masking for Visual Representation Learning

Figure 2 for From Pixels to Components: Eigenvector Masking for Visual Representation Learning

Figure 3 for From Pixels to Components: Eigenvector Masking for Visual Representation Learning

Figure 4 for From Pixels to Components: Eigenvector Masking for Visual Representation Learning

Abstract:Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning. However, the common practice of masking random patches of pixels exhibits certain failure modes, which can prevent learning meaningful high-level features, as required for downstream tasks. We propose an alternative masking strategy that operates on a suitable transformation of the data rather than on the raw pixels. Specifically, we perform principal component analysis and then randomly mask a subset of components, which accounts for a fixed ratio of the data variance. The learning task then amounts to reconstructing the masked components from the visible ones. Compared to local patches of pixels, the principal components of images carry more global information. We thus posit that predicting masked from visible components involves more high-level features, allowing our masking strategy to extract more useful representations. This is corroborated by our empirical findings which demonstrate improved image classification performance for component over pixel masking. Our method thus constitutes a simple and robust data-driven alternative to traditional masked image modeling approaches.

Via

Access Paper or Ask Questions

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Feb 05, 2025

Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt(+5 more)

Figure 1 for RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Figure 2 for RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Figure 3 for RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Figure 4 for RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Abstract:The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

* 21 pages, 15 figures

Via

Access Paper or Ask Questions

Weakly-Supervised Multimodal Learning on MIMIC-CXR

Nov 15, 2024

Andrea Agostini, Daphné Chopard, Yang Meng, Norbert Fortin, Babak Shahbaba, Stephan Mandt, Thomas M. Sutter, Julia E. Vogt

Figure 1 for Weakly-Supervised Multimodal Learning on MIMIC-CXR

Figure 2 for Weakly-Supervised Multimodal Learning on MIMIC-CXR

Figure 3 for Weakly-Supervised Multimodal Learning on MIMIC-CXR

Figure 4 for Weakly-Supervised Multimodal Learning on MIMIC-CXR

Abstract:Multimodal data integration and label scarcity pose significant challenges for machine learning in medical settings. To address these issues, we conduct an in-depth evaluation of the newly proposed Multimodal Variational Mixture-of-Experts (MMVM) VAE on the challenging MIMIC-CXR dataset. Our analysis demonstrates that the MMVM VAE consistently outperforms other multimodal VAEs and fully supervised approaches, highlighting its strong potential for real-world medical applications.

* Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 13 pages. arXiv admin note: text overlap with arXiv:2403.05300

Via

Access Paper or Ask Questions

Cross-Entropy Is All You Need To Invert the Data Generating Process

Oct 29, 2024

Patrik Reizinger, Alice Bizeul, Attila Juhos, Julia E. Vogt, Randall Balestriero, Wieland Brendel, David Klindt

Figure 1 for Cross-Entropy Is All You Need To Invert the Data Generating Process

Figure 2 for Cross-Entropy Is All You Need To Invert the Data Generating Process

Figure 3 for Cross-Entropy Is All You Need To Invert the Data Generating Process

Figure 4 for Cross-Entropy Is All You Need To Invert the Data Generating Process

Abstract:Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.

Via

Access Paper or Ask Questions

Exploiting Interpretable Capabilities with Concept-Enhanced Diffusion and Prototype Networks

Oct 24, 2024

Alba Carballo-Castro, Sonia Laguna, Moritz Vandenhirtz, Julia E. Vogt

Abstract:Concept-based machine learning methods have increasingly gained importance due to the growing interest in making neural networks interpretable. However, concept annotations are generally challenging to obtain, making it crucial to leverage all their prior knowledge. By creating concept-enriched models that incorporate concept information into existing architectures, we exploit their interpretable capabilities to the fullest extent. In particular, we propose Concept-Guided Conditional Diffusion, which can generate visual representations of concepts, and Concept-Guided Prototype Networks, which can create a concept prototype dataset and leverage it to perform interpretable concept prediction. These results open up new lines of research by exploiting pre-existing information in the quest for rendering machine learning more human-understandable.

* Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Hierarchical Clustering for Conditional Diffusion in Image Generation

Oct 22, 2024

Jorge da Silva Goncalves, Laura Manduchi, Moritz Vandenhirtz, Julia E. Vogt

Figure 1 for Hierarchical Clustering for Conditional Diffusion in Image Generation

Figure 2 for Hierarchical Clustering for Conditional Diffusion in Image Generation

Figure 3 for Hierarchical Clustering for Conditional Diffusion in Image Generation

Figure 4 for Hierarchical Clustering for Conditional Diffusion in Image Generation

Abstract:Finding clusters of data points with similar characteristics and generating new cluster-specific samples can significantly enhance our understanding of complex data distributions. While clustering has been widely explored using Variational Autoencoders, these models often lack generation quality in real-world datasets. This paper addresses this gap by introducing TreeDiffusion, a deep generative model that conditions Diffusion Models on hierarchical clusters to obtain high-quality, cluster-specific generations. The proposed pipeline consists of two steps: a VAE-based clustering model that learns the hierarchical structure of the data, and a conditional diffusion model that generates realistic images for each cluster. We propose this two-stage process to ensure that the generated samples remain representative of their respective clusters and enhance image fidelity to the level of diffusion models. A key strength of our method is its ability to create images for each cluster, providing better visualization of the learned representations by the clustering model, as demonstrated through qualitative results. This method effectively addresses the generative limitations of VAE-based approaches while preserving their clustering performance. Empirically, we demonstrate that conditioning diffusion models on hierarchical clusters significantly enhances generative performance, thereby advancing the state of generative clustering models.

* 25 pages, submitted to ICLR 2025

Via

Access Paper or Ask Questions

From Logits to Hierarchies: Hierarchical Clustering made Simple

Oct 10, 2024

Emanuele Palumbo, Moritz Vandenhirtz, Alain Ryser, Imant Daunhawer, Julia E. Vogt

Figure 1 for From Logits to Hierarchies: Hierarchical Clustering made Simple

Figure 2 for From Logits to Hierarchies: Hierarchical Clustering made Simple

Figure 3 for From Logits to Hierarchies: Hierarchical Clustering made Simple

Figure 4 for From Logits to Hierarchies: Hierarchical Clustering made Simple

Abstract:The structure of many real-world datasets is intrinsically hierarchical, making the modeling of such hierarchies a critical objective in both unsupervised and supervised machine learning. Recently, novel approaches for hierarchical clustering with deep architectures have been proposed. In this work, we take a critical perspective on this line of research and demonstrate that many approaches exhibit major limitations when applied to realistic datasets, partly due to their high computational complexity. In particular, we show that a lightweight procedure implemented on top of pre-trained non-hierarchical clustering models outperforms models designed specifically for hierarchical clustering. Our proposed approach is computationally efficient and applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our findings, we illustrate how our method can also be applied in a supervised setup, recovering meaningful hierarchies from a pre-trained ImageNet classifier.

Via

Access Paper or Ask Questions