Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Richard Hartley

Probability Density Geodesics in Image Diffusion Latent Space

Apr 09, 2025

Qingtao Yu, Jaskirat Singh, Zhaoyuan Yang, Peter Henry Tu, Jing Zhang, Hongdong Li, Richard Hartley, Dylan Campbell

Abstract:Diffusion models indirectly estimate the probability density over a data space, which can be used to study its structure. In this work, we show that geodesics can be computed in diffusion latent space, where the norm induced by the spatially-varying inner product is inversely proportional to the probability density. In this formulation, a path that traverses a high density (that is, probable) region of image latent space is shorter than the equivalent path through a low density region. We present algorithms for solving the associated initial and boundary value problems and show how to compute the probability density along the path and the geodesic distance between two points. Using these techniques, we analyze how closely video clips approximate geodesics in a pre-trained image diffusion space. Finally, we demonstrate how these techniques can be applied to training-free image sequence interpolation and extrapolation, given a pre-trained image diffusion model.

* CVPR2025

Via

Access Paper or Ask Questions

NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications

Mar 14, 2025

Li Cui, Yang Ding, Richard Hartley, Zirui Xie, Laurent Kneip, Zhenghua Yu

Abstract:We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-dimensional latent codes. Furthermore, the newly proposed architecture exhibits a significant performance improvement in the presence of only sparse and noisy data, which is demonstrated through comparative experiments on synthetic data. The module is embedded into the back-end of a stereo-vision based framework for joint, incremental shape optimization. The loss function is given by a combination of a sparse 3D point-based SDF loss, a sparse rendering loss, and a semantic mask-based silhouette-consistency term. We furthermore leverage semantic information to determine keypoint extraction density in the front-end. Finally, experimental results on real-world data reveal accurate and reliable performance comparable to alternative frameworks that make use of direct depth readings. The proposed method performs well with only sparse 3D points obtained from bundle adjustment, and eventually continues to deliver stable results even under exclusive use of the mask-consistency term.

* 9 pages, 5 figures, IROS 2024

Via

Access Paper or Ask Questions

Exploring Bias in over 100 Text-to-Image Generative Models

Mar 11, 2025

Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian

Abstract:We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development. Keywords: Bias, Ethical AI, Text-to-Image, Generative Models, Open-Source Models

* Accepted to ICLR 2025 Workshop on Open Science for Foundation Models (SCI-FM)

Via

Access Paper or Ask Questions

Motion Anything: Any to Motion Generation

Mar 10, 2025

Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley

Abstract:Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything

Via

Access Paper or Ask Questions

DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes

Jan 06, 2025

Xuyang Wang, Ziang Cheng, Zhenyu Li, Jiayu Yang, Haorui Ji, Pan Ji, Mehrtash Harandi, Richard Hartley, Hongdong Li

Figure 1 for DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes

Figure 2 for DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes

Figure 3 for DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes

Figure 4 for DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes

Abstract:This paper proposes DoubleDiffusion, a novel framework that combines heat dissipation diffusion and denoising diffusion for direct generative learning on 3D mesh surfaces. Our approach addresses the challenges of generating continuous signal distributions residing on a curve manifold surface. Unlike previous methods that rely on unrolling 3D meshes into 2D or adopting field representations, DoubleDiffusion leverages the Laplacian-Beltrami operator to process features respecting the mesh structure. This combination enables effective geometry-aware signal diffusion across the underlying geometry. As shown in Fig.~\ref{fig:teaser}, we demonstrate that DoubleDiffusion has the ability to generate RGB signal distributions on complex 3D mesh surfaces and achieves per-category shape-conditioned texture generation across different shape geometry. Our work contributes a new direction in diffusion-based generative modeling on 3D surfaces, with potential applications in the field of 3D asset generation.

Via

Access Paper or Ask Questions

On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

Nov 21, 2024

Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian

Figure 1 for On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

Figure 2 for On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

Figure 3 for On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

Figure 4 for On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

Abstract:The widespread availability of multimodal generative models has sparked critical discussions on their fairness, reliability, and potential for misuse. While text-to-image models can produce high-fidelity, user-guided images, they also exhibit unpredictable behavior and vulnerabilities, which can be exploited to manipulate class or concept representations. To address this, we propose an evaluation framework designed to assess model reliability through their responses to globally- and locally-applied `semantic' perturbations in the embedding space, pinpointing inputs that trigger unreliable behavior. Our approach offers deeper insights into two essential aspects: (i) generative diversity, evaluating the breadth of visual representations for learned concepts, and (ii) generative fairness, examining how removing concepts from input prompts affects semantic guidance. Beyond these evaluations, our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance. We will release our code. Keywords: Fairness, Reliability, AI Ethics, Bias, Text-to-Image Models

* This research is supported by the NISDRG project #20100007, funded by the Australian Government

Via

Access Paper or Ask Questions

Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

Nov 21, 2024

Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian

Figure 1 for Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

Figure 2 for Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

Figure 3 for Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

Figure 4 for Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

Abstract:Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. While model editing has been proposed to remove or filter undesirable concepts in embedding and latent spaces, it can inadvertently damage learned manifolds, distorting concepts in close semantic proximity. We identify limitations in current model editing techniques, showing that even benign, proximal concepts may become misaligned. To address the need for safe content generation, we propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process using tunable weighted summation in the latent space to generate safer images. Our method preserves global context without compromising the structural integrity of the learned manifolds. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models. We will release our code. Keywords: Text-to-Image Models, Generative AI, Safety, Reliability, Model Editing

* This research is supported by the NISDRG project #20100007, funded by the Australian Government

Via

Access Paper or Ask Questions

Learning-based Multi-View Stereo: A Survey

Aug 27, 2024

Fangjinhua Wang, Qingtian Zhu, Di Chang, Quankai Gao, Junlin Han, Tong Zhang, Richard Hartley, Marc Pollefeys

Figure 1 for Learning-based Multi-View Stereo: A Survey

Figure 2 for Learning-based Multi-View Stereo: A Survey

Figure 3 for Learning-based Multi-View Stereo: A Survey

Figure 4 for Learning-based Multi-View Stereo: A Survey

Abstract:3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.

Via

Access Paper or Ask Questions

InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

Jul 14, 2024

Zeyu Zhang, Akide Liu, Qi Chen, Feng Chen, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang

Abstract:Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling. To solve these challenges, we propose InfiniMotion, a method that generates continuous motion sequences of arbitrary length within an autoregressive framework. We highlight its groundbreaking capability by generating a continuous 1-hour human motion with around 80,000 frames. Specifically, we introduce the Motion Memory Transformer with Bidirectional Mamba Memory, enhancing the transformer's memory to process long motion sequences effectively without overwhelming computational resources. Notably our method achieves over 30% improvement in FID and 6 times longer demonstration compared to previous state-of-the-art methods, showcasing significant advancements in long motion generation. See project webpage: https://steve-zeyu-zhang.github.io/InfiniMotion/

Via

Access Paper or Ask Questions

Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective

May 27, 2024

Zhen Qin, Xuyang Shen, Dong Li, Weigao Sun, Stan Birchfield, Richard Hartley, Yiran Zhong

Abstract:We present the Linear Complexity Sequence Model (LCSM), a comprehensive solution that unites various sequence modeling techniques with linear complexity, including linear attention, state space model, long convolution, and linear RNN, within a single framework. The goal is to enhance comprehension of these models by analyzing the impact of each component from a cohesive and streamlined viewpoint. Specifically, we segment the modeling processes of these models into three distinct stages: Expand, Oscillation, and Shrink (EOS), with each model having its own specific settings. The Expand stage involves projecting the input signal onto a high-dimensional memory state. This is followed by recursive operations performed on the memory state in the Oscillation stage. Finally, the memory state is projected back to a low-dimensional space in the Shrink stage. We perform comprehensive experiments to analyze the impact of different stage settings on language modeling and retrieval tasks. Our results show that data-driven methods are crucial for the effectiveness of the three stages in language modeling, whereas hand-crafted methods yield better performance in retrieval tasks.

* Technical report. Yiran Zhong is the corresponding author

Via

Access Paper or Ask Questions