Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sean Sedwards

HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Jun 23, 2025

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

Abstract:Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.

* Work in progress

Via

Access Paper or Ask Questions

LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts

Apr 07, 2025

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

Abstract:Redundancy of visual tokens in multi-modal large language models (MLLMs) significantly reduces their computational efficiency. Recent approaches, such as resamplers and summarizers, have sought to reduce the number of visual tokens, but at the cost of visual reasoning ability. To address this, we propose LEO-MINI, a novel MLLM that significantly reduces the number of visual tokens and simultaneously boosts visual reasoning capabilities. For efficiency, LEO-MINI incorporates CoTR, a novel token reduction module to consolidate a large number of visual tokens into a smaller set of tokens, using the similarity between visual tokens, text tokens, and a compact learnable query. For effectiveness, to scale up the model's ability with minimal computational overhead, LEO-MINI employs MMoE, a novel mixture of multi-modal experts module. MMOE employs a set of LoRA experts with a novel router to switch between them based on the input text and visual tokens instead of only using the input hidden state. MMoE also includes a general LoRA expert that is always activated to learn general knowledge for LLM reasoning. For extracting richer visual features, MMOE employs a set of vision experts trained on diverse domain-specific data. To demonstrate LEO-MINI's improved efficiency and performance, we evaluate it against existing efficient MLLMs on various benchmark vision-language tasks.

Via

Access Paper or Ask Questions

OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

Mar 09, 2025

Adrian Chow, Evelien Riddell, Yimu Wang, Sean Sedwards, Krzysztof Czarnecki

Abstract:Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.

Via

Access Paper or Ask Questions

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Jan 13, 2025

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

Abstract:Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

Via

Access Paper or Ask Questions

VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation

Nov 20, 2024

Chengjie Huang, Vahdat Abdelzad, Sean Sedwards, Krzysztof Czarnecki

Figure 1 for VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation

Figure 2 for VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation

Figure 3 for VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation

Figure 4 for VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation

Abstract:Input aggregation is a simple technique used by state-of-the-art LiDAR 3D object detectors to improve detection. However, increasing aggregation is known to have diminishing returns and even performance degradation, due to objects responding differently to the number of aggregated frames. To address this limitation, we propose an efficient adaptive method, which we call Variable Aggregation Detection (VADet). Instead of aggregating the entire scene using a fixed number of frames, VADet performs aggregation per object, with the number of frames determined by an object's observed properties, such as speed and point density. VADet thus reduces the inherent trade-offs of fixed aggregation and is not architecture specific. To demonstrate its benefits, we apply VADet to three popular single-stage detectors and achieve state-of-the-art performance on the Waymo dataset.

* Accepted by WACV 2025

Via

Access Paper or Ask Questions

SOAP: Cross-sensor Domain Adaptation for 3D Object Detection Using Stationary Object Aggregation Pseudo-labelling

Jan 08, 2024

Chengjie Huang, Vahdat Abdelzad, Sean Sedwards, Krzysztof Czarnecki

Figure 1 for SOAP: Cross-sensor Domain Adaptation for 3D Object Detection Using Stationary Object Aggregation Pseudo-labelling

Figure 2 for SOAP: Cross-sensor Domain Adaptation for 3D Object Detection Using Stationary Object Aggregation Pseudo-labelling

Figure 3 for SOAP: Cross-sensor Domain Adaptation for 3D Object Detection Using Stationary Object Aggregation Pseudo-labelling

Figure 4 for SOAP: Cross-sensor Domain Adaptation for 3D Object Detection Using Stationary Object Aggregation Pseudo-labelling

Abstract:We consider the problem of cross-sensor domain adaptation in the context of LiDAR-based 3D object detection and propose Stationary Object Aggregation Pseudo-labelling (SOAP) to generate high quality pseudo-labels for stationary objects. In contrast to the current state-of-the-art in-domain practice of aggregating just a few input scans, SOAP aggregates entire sequences of point clouds at the input level to reduce the sensor domain gap. Then, by means of what we call quasi-stationary training and spatial consistency post-processing, the SOAP model generates accurate pseudo-labels for stationary objects, closing a minimum of 30.3% domain gap compared to few-frame detectors. Our results also show that state-of-the-art domain adaptation approaches can achieve even greater performance in combination with SOAP, in both the unsupervised and semi-supervised settings.

* Accepted by WACV 2024

Via

Access Paper or Ask Questions

A Hierarchical Pedestrian Behavior Model to Generate Realistic Human Behavior in Traffic Simulation

Jun 01, 2022

Scott Larter, Rodrigo Queiroz, Sean Sedwards, Atrisha Sarkar, Krzysztof Czarnecki

Figure 1 for A Hierarchical Pedestrian Behavior Model to Generate Realistic Human Behavior in Traffic Simulation

Figure 2 for A Hierarchical Pedestrian Behavior Model to Generate Realistic Human Behavior in Traffic Simulation

Figure 3 for A Hierarchical Pedestrian Behavior Model to Generate Realistic Human Behavior in Traffic Simulation

Figure 4 for A Hierarchical Pedestrian Behavior Model to Generate Realistic Human Behavior in Traffic Simulation

Abstract:Modelling pedestrian behavior is crucial in the development and testing of autonomous vehicles. In this work, we present a hierarchical pedestrian behavior model that generates high-level decisions through the use of behavior trees, in order to produce maneuvers executed by a low-level motion planner using an adapted Social Force model. A full implementation of our work is integrated into GeoScenario Server, a scenario definition and execution engine, extending its vehicle simulation capabilities with pedestrian simulation. The extended environment allows simulating test scenarios involving both vehicles and pedestrians to assist in the scenario-based testing process of autonomous vehicles. The presented hierarchical model is evaluated on two real-world data sets collected at separate locations with different road structures. Our model is shown to replicate the real-world pedestrians' trajectories with a high degree of fidelity and a decision-making accuracy of 98% or better, given only high-level routing information for each pedestrian.

* 9 pages, 4 figures, 3 tables. Accepted to the 2022 IEEE Intelligent Vehicles Symposium

Via

Access Paper or Ask Questions

Recursive Constraints to Prevent Instability in Constrained Reinforcement Learning

Jan 20, 2022

Jaeyoung Lee, Sean Sedwards, Krzysztof Czarnecki

Abstract:We consider the challenge of finding a deterministic policy for a Markov decision process that uniformly (in all states) maximizes one reward subject to a probabilistic constraint over a different reward. Existing solutions do not fully address our precise problem definition, which nevertheless arises naturally in the context of safety-critical robotic systems. This class of problem is known to be hard, but the combined requirements of determinism and uniform optimality can create learning instability. In this work, after describing and motivating our problem with a simple example, we present a suitable constrained reinforcement learning algorithm that prevents learning instability, using recursive constraints. Our proposed approach admits an approximative form that improves efficiency and is conservative w.r.t. the constraint.

* Accepted at 1st Multi-Objective Decision Making Workshop (MODeM 2021). Cite as: Jaeyoung Lee, Sean Sedwards and Krzysztof Czarnecki. (2021). Recursive constraints to prevent instability in constrained reinforcement learning. In: Proc. 1st Multi-Objective Decision Making Workshop (MODeM 2021), Hayes, Mannion, Vamplew (eds). Online at http://modem2021.cs.nuigalway.ie

Via

Access Paper or Ask Questions

Design Space of Behaviour Planning for Autonomous Driving

Aug 21, 2019

Marko Ilievski, Sean Sedwards, Ashish Gaurav, Aravind Balakrishnan, Atrisha Sarkar, Jaeyoung Lee, Frédéric Bouchard, Ryan De Iaco, Krzysztof Czarnecki

Figure 1 for Design Space of Behaviour Planning for Autonomous Driving

Figure 2 for Design Space of Behaviour Planning for Autonomous Driving

Figure 3 for Design Space of Behaviour Planning for Autonomous Driving

Figure 4 for Design Space of Behaviour Planning for Autonomous Driving

Abstract:We explore the complex design space of behaviour planning for autonomous driving. Design choices that successfully address one aspect of behaviour planning can critically constrain others. To aid the design process, in this work we decompose the design space with respect to important choices arising from the current state of the art approaches, and describe the resulting trade-offs. In doing this, we also identify interesting directions of future work.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving

Feb 11, 2019

Jaeyoung Lee, Aravind Balakrishnan, Ashish Gaurav, Krzysztof Czarnecki, Sean Sedwards

Figure 1 for WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving

Figure 2 for WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving

Figure 3 for WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving

Figure 4 for WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving

Abstract:Machine learning can provide efficient solutions to the complex problems encountered in autonomous driving, but ensuring their safety remains a challenge. A number of authors have attempted to address this issue, but there are few publicly-available tools to adequately explore the trade-offs between functionality, scalability, and safety. We thus present WiseMove, a software framework to investigate safe deep reinforcement learning in the context of motion planning for autonomous driving. WiseMove adopts a modular learning architecture that suits our current research questions and can be adapted to new technologies and new questions. We present the details of WiseMove, demonstrate its use on a common traffic scenario, and describe how we use it in our ongoing safe learning research.

Via

Access Paper or Ask Questions