Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stephanie Gil

Characterization and Mitigation of Training Instabilities in Microscaling Formats

Jun 25, 2025

Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand

Abstract:Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at https://github.com/Hither1/systems-scaling.

* 14 pages + appendices

Via

Access Paper or Ask Questions

Confidence Boosts Trust-Based Resilience in Cooperative Multi-Robot Systems

Jun 10, 2025

Luca Ballotta, Áron Vékássy, Stephanie Gil, Michal Yemini

Abstract:Wireless communication-based multi-robot systems open the door to cyberattacks that can disrupt safety and performance of collaborative robots. The physical channel supporting inter-robot communication offers an attractive opportunity to decouple the detection of malicious robots from task-relevant data exchange between legitimate robots. Yet, trustworthiness indications coming from physical channels are uncertain and must be handled with this in mind. In this paper, we propose a resilient protocol for multi-robot operation wherein a parameter {\lambda}t accounts for how confident a robot is about the legitimacy of nearby robots that the physical channel indicates. Analytical results prove that our protocol achieves resilient coordination with arbitrarily many malicious robots under mild assumptions. Tuning {\lambda}t allows a designer to trade between near-optimal inter-robot coordination and quick task execution; see Fig. 1. This is a fundamental performance tradeoff and must be carefully evaluated based on the task at hand. The effectiveness of our approach is numerically verified with experiments involving platoons of autonomous cars where some vehicles are maliciously spoofed.

* This work has been submitted to IEEE for possible publication

Via

Access Paper or Ask Questions

Interpreting the Linear Structure of Vision-language Model Embedding Spaces

Apr 16, 2025

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Naomi Saphra, Sham Kakade, Stephanie Gil

Abstract:Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or "concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that the key commonly-activating concepts extracted by SAEs are remarkably stable across runs. Interestingly, while most concepts are strongly unimodal in activation, we find they are not merely encoding modality per se. Many lie close to - but not entirely within - the subspace defining modality, suggesting that they encode cross-modal semantics despite their unimodal usage. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even unimodal concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges-offering new insight into how multimodal meaning is constructed.

Via

Access Paper or Ask Questions

Provably Stable Multi-Agent Routing with Bounded-Delay Adversaries in the Decision Loop

Apr 01, 2025

Roee M. Francos, Daniel Garces, Stephanie Gil

Abstract:In this work, we are interested in studying multi-agent routing settings, where adversarial agents are part of the assignment and decision loop, degrading the performance of the fleet by incurring bounded delays while servicing pickup-and-delivery requests. Specifically, we are interested in characterizing conditions on the fleet size and the proportion of adversarial agents for which a routing policy remains stable, where stability for a routing policy is achieved if the number of outstanding requests is uniformly bounded over time. To obtain this characterization, we first establish a threshold on the proportion of adversarial agents above which previously stable routing policies for fully cooperative fleets are provably unstable. We then derive a sufficient condition on the fleet size to recover stability given a maximum proportion of adversarial agents. We empirically validate our theoretical results on a case study on autonomous taxi routing, where we consider transportation requests from real San Francisco taxicab data.

* 14 pages, 4 figures

Via

Access Paper or Ask Questions

Pro-Routing: Proactive Routing of Autonomous Multi-Capacity Robots for Pickup-and-Delivery Tasks

Mar 31, 2025

Daniel Garces, Stephanie Gil

Abstract:We consider a multi-robot setting, where we have a fleet of multi-capacity autonomous robots that must service spatially distributed pickup-and-delivery requests with fixed maximum wait times. Requests can be either scheduled ahead of time or they can enter the system in real-time. In this setting, stability for a routing policy is defined as the cost of the policy being uniformly bounded over time. Most previous work either solve the problem offline to theoretically maintain stability or they consider dynamically arriving requests at the expense of the theoretical guarantees on stability. In this paper, we aim to bridge this gap by proposing a novel proactive rollout-based routing framework that adapts to real-time demand while still provably maintaining the stability of the learned routing policy. We derive provable stability guarantees for our method by proposing a fleet sizing algorithm that obtains a sufficiently large fleet that ensures stability by construction. To validate our theoretical results, we consider a case study on real ride requests for Harvard's evening Van System. We also evaluate the performance of our framework using the currently deployed smaller fleet size. In this smaller setup, we compare against the currently deployed routing algorithm, greedy heuristics, and Monte-Carlo-Tree-Search-based algorithms. Our empirical results show that our framework maintains stability when we use the sufficiently large fleet size found in our theoretical results. For the smaller currently deployed fleet size, our method services 6% more requests than the closest baseline while reducing median passenger wait times by 33%.

* 25 pages, 7 figures, and 1 table

Via

Access Paper or Ask Questions

Data-Efficient Multi-Agent Spatial Planning with LLMs

Feb 26, 2025

Huangyuan Su, Aaron Walsman, Daniel Garces, Sham Kakade, Stephanie Gil

Abstract:In this project, our goal is to determine how to leverage the world-knowledge of pretrained large language models for efficient and robust learning in multiagent decision making. We examine this in a taxi routing and assignment problem where agents must decide how to best pick up passengers in order to minimize overall waiting time. While this problem is situated on a graphical road network, we show that with the proper prompting zero-shot performance is quite strong on this task. Furthermore, with limited fine-tuning along with the one-at-a-time rollout algorithm for look ahead, LLMs can out-compete existing approaches with 50 times fewer environmental interactions. We also explore the benefits of various linguistic prompting approaches and show that including certain easy-to-compute information in the prompt significantly improves performance. Finally, we highlight the LLM's built-in semantic understanding, showing its ability to adapt to environmental factors through simple prompts.

Via

Access Paper or Ask Questions

WiSER-X: Wireless Signals-based Efficient Decentralized Multi-Robot Exploration without Explicit Information Exchange

Dec 27, 2024

Ninad Jadhav, Meghna Behari, Robert J. Wood, Stephanie Gil

Abstract:We introduce a Wireless Signal based Efficient multi-Robot eXploration (WiSER-X) algorithm applicable to a decentralized team of robots exploring an unknown environment with communication bandwidth constraints. WiSER-X relies only on local inter-robot relative position estimates, that can be obtained by exchanging signal pings from onboard sensors such as WiFi, Ultra-Wide Band, amongst others, to inform the exploration decisions of individual robots to minimize redundant coverage overlaps. Furthermore, WiSER-X also enables asynchronous termination without requiring a shared map between the robots. It also adapts to heterogeneous robot behaviors and even complete failures in unknown environment while ensuring complete coverage. Simulations show that WiSER-X leads to 58% lower overlap than a zero-information-sharing baseline algorithm-1 and only 23% more overlap than a full-information-sharing algorithm baseline algorithm-2.

Via

Access Paper or Ask Questions

WiFi-CSI Sensing and Bearing Estimation in Multi-Robot Systems: An Open-Source Simulation Framework

Oct 02, 2024

Brendan Dijkstra, Ninad Jadhav, Alex Sloot, Matteo Marcantoni, Bayu Jayawardhana, Stephanie Gil, Bahar Haghighat

Figure 1 for WiFi-CSI Sensing and Bearing Estimation in Multi-Robot Systems: An Open-Source Simulation Framework

Figure 2 for WiFi-CSI Sensing and Bearing Estimation in Multi-Robot Systems: An Open-Source Simulation Framework

Figure 3 for WiFi-CSI Sensing and Bearing Estimation in Multi-Robot Systems: An Open-Source Simulation Framework

Figure 4 for WiFi-CSI Sensing and Bearing Estimation in Multi-Robot Systems: An Open-Source Simulation Framework

Abstract:Development and testing of multi-robot systems employing wireless signal-based sensing requires access to suitable hardware, such as channel monitoring WiFi transceivers, which can pose significant limitations. The WiFi Sensor for Robotics (WSR) toolbox, introduced by Jadhav et al. in 2022, provides a novel solution by using WiFi Channel State Information (CSI) to compute relative bearing between robots. The toolbox leverages the amplitude and phase of WiFi signals and creates virtual antenna arrays by exploiting the motion of mobile robots, eliminating the need for physical antenna arrays. However, the WSR toolbox's reliance on an obsoleting WiFi transceiver hardware has limited its operability and accessibility, hindering broader application and development of relevant tools. We present an open-source simulation framework that replicates the WSR toolbox's capabilities using Gazebo and Matlab. By simulating WiFi-CSI data collection, our framework emulates the behavior of mobile robots equipped with the WSR toolbox, enabling precise bearing estimation without physical hardware. We validate the framework through experiments with both simulated and real Turtlebot3 robots, showing a close match between the obtained CSI data and the resulting bearing estimates. This work provides a virtual environment for developing and testing WiFi-CSI-based multi-robot localization without relying on physical hardware. All code and experimental setup information are publicly available at https://github.com/BrendanxP/CSI-Simulation-Framework

* 6+1 pages (text + references), 6 figures

Via

Access Paper or Ask Questions

Fast Distributed Optimization over Directed Graphs under Malicious Attacks using Trust

Jul 09, 2024

Arif Kerem Dayı, Orhan Eren Akgün, Stephanie Gil, Michal Yemini, Angelia Nedić

Figure 1 for Fast Distributed Optimization over Directed Graphs under Malicious Attacks using Trust

Figure 2 for Fast Distributed Optimization over Directed Graphs under Malicious Attacks using Trust

Abstract:In this work, we introduce the Resilient Projected Push-Pull (RP3) algorithm designed for distributed optimization in multi-agent cyber-physical systems with directed communication graphs and the presence of malicious agents. Our algorithm leverages stochastic inter-agent trust values and gradient tracking to achieve geometric convergence rates in expectation even in adversarial environments. We introduce growing constraint sets to limit the impact of the malicious agents without compromising the geometric convergence rate of the algorithm. We prove that RP3 converges to the nominal optimal solution almost surely and in the $r$-th mean for any $r\geq 1$, provided the step sizes are sufficiently small and the constraint sets are appropriately chosen. We validate our approach with numerical studies on average consensus and multi-robot target tracking problems, demonstrating that RP3 effectively mitigates the impact of malicious agents and achieves the desired geometric convergence.

* 31 pages, 2 figures

Via

Access Paper or Ask Questions

MULAN-WC: Multi-Robot Localization Uncertainty-aware Active NeRF with Wireless Coordination

Mar 20, 2024

Weiying Wang, Victor Cai, Stephanie Gil

Abstract:This paper presents MULAN-WC, a novel multi-robot 3D reconstruction framework that leverages wireless signal-based coordination between robots and Neural Radiance Fields (NeRF). Our approach addresses key challenges in multi-robot 3D reconstruction, including inter-robot pose estimation, localization uncertainty quantification, and active best-next-view selection. We introduce a method for using wireless Angle-of-Arrival (AoA) and ranging measurements to estimate relative poses between robots, as well as quantifying and incorporating the uncertainty embedded in the wireless localization of these pose estimates into the NeRF training loss to mitigate the impact of inaccurate camera poses. Furthermore, we propose an active view selection approach that accounts for robot pose uncertainty when determining the next-best views to improve the 3D reconstruction, enabling faster convergence through intelligent view selection. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our framework in theory and in practice. Leveraging wireless coordination and localization uncertainty-aware training, MULAN-WC can achieve high-quality 3d reconstruction which is close to applying the ground truth camera poses. Furthermore, the quantification of the information gain from a novel view enables consistent rendering quality improvement with incrementally captured images by commending the robot the novel view position. Our hardware experiments showcase the practicality of deploying MULAN-WC to real robotic systems.

Via

Access Paper or Ask Questions