Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Connor Schenck

Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

Feb 04, 2025

Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid(+12 more)

Abstract:We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.

* Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics

Via

Access Paper or Ask Questions

Linear Transformer Topological Masking with Graph Random Features

Oct 04, 2024

Isaac Reid, Kumar Avinava Dubey, Deepali Jain, Will Whitney, Amr Ahmed, Joshua Ainslie, Alex Bewley, Mithun Jacob, Aranyak Mehta, David Rendleman(+5 more)

Figure 1 for Linear Transformer Topological Masking with Graph Random Features

Figure 2 for Linear Transformer Topological Masking with Graph Random Features

Figure 3 for Linear Transformer Topological Masking with Graph Random Features

Figure 4 for Linear Transformer Topological Masking with Graph Random Features

Abstract:When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix -- a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving $\mathcal{O}(N)$ time and space complexity with respect to the number of input tokens. The fastest previous alternative was $\mathcal{O}(N \log N)$ and only suitable for specific graphs. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data, including with $>30$k nodes.

Via

Access Paper or Ask Questions

Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Jul 10, 2024

Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah(+12 more)

Figure 1 for Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Figure 2 for Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Figure 3 for Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Figure 4 for Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Abstract:An elusive goal in navigation research is to build an intelligent agent that can understand multimodal instructions including natural language and image, and perform useful navigation. To achieve this, we study a widely useful category of navigation tasks we call Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video. Recent advances in Vision Language Models (VLMs) have shown a promising path in achieving this goal as it demonstrates capabilities in perceiving and reasoning about multimodal inputs. However, VLMs are typically trained to predict textual output and it is an open research question about how to best utilize them in navigation. To solve MINT, we present Mobility VLA, a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs and a robust low-level navigation policy based on topological graphs. The high-level policy consists of a long-context VLM that takes the demonstration tour video and the multimodal user instruction as input to find the goal frame in the tour video. Next, a low-level policy uses the goal frame and an offline constructed topological graph to generate robot actions at every timestep. We evaluated Mobility VLA in a 836m^2 real world environment and show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions such as "Where should I return this?" while holding a plastic bin.

Via

Access Paper or Ask Questions

SPNets: Differentiable Fluid Dynamics for Deep Neural Networks

Sep 26, 2018

Connor Schenck, Dieter Fox

Figure 1 for SPNets: Differentiable Fluid Dynamics for Deep Neural Networks

Figure 2 for SPNets: Differentiable Fluid Dynamics for Deep Neural Networks

Figure 3 for SPNets: Differentiable Fluid Dynamics for Deep Neural Networks

Figure 4 for SPNets: Differentiable Fluid Dynamics for Deep Neural Networks

Abstract:In this paper we introduce Smooth Particle Networks (SPNets), a framework for integrating fluid dynamics with deep networks. SPNets adds two new layers to the neural network toolbox: ConvSP and ConvSDF, which enable computing physical interactions with unordered particle sets. We use these lay- ers in combination with standard neural network layers to directly implement fluid dynamics inside a deep network, where the parameters of the network are the fluid parameters themselves (e.g., viscosity, cohesion, etc.). Because SPNets are imple- mented as a neural network, the resulting fluid dynamics are fully differentiable. We then show how this can be successfully used to learn fluid parameters from data, perform liquid control tasks, and learn policies to manipulate liquids.

* Conference on Robot Learning (CoRL) 2018

Via

Access Paper or Ask Questions

Learning Robotic Manipulation of Granular Media

Oct 25, 2017

Connor Schenck, Jonathan Tompson, Dieter Fox, Sergey Levine

Figure 1 for Learning Robotic Manipulation of Granular Media

Figure 2 for Learning Robotic Manipulation of Granular Media

Figure 3 for Learning Robotic Manipulation of Granular Media

Figure 4 for Learning Robotic Manipulation of Granular Media

Abstract:In this paper, we examine the problem of robotic manipulation of granular media. We evaluate multiple predictive models used to infer the dynamics of scooping and dumping actions. These models are evaluated on a task that involves manipulating the media in order to deform it into a desired shape. Our best performing model is based on a highly-tailored convolutional network architecture with domain-specific optimizations, which we show accurately models the physical interaction of the robotic scoop with the underlying media. We empirically demonstrate that explicitly predicting physical mechanics results in a policy that out-performs both a hand-crafted dynamics baseline, and a "value-network", which must otherwise implicitly predict the same mechanics in order to produce accurate value estimates.

* Proceedings of the Conference on Robot Learning 2017 (CoRL) (to appear)

Via

Access Paper or Ask Questions

Guided Policy Search with Delayed Sensor Measurements

Sep 29, 2017

Connor Schenck, Dieter Fox

Figure 1 for Guided Policy Search with Delayed Sensor Measurements

Figure 2 for Guided Policy Search with Delayed Sensor Measurements

Figure 3 for Guided Policy Search with Delayed Sensor Measurements

Abstract:Guided policy search is a method for reinforcement learning that trains a general policy for accomplishing a given task by guiding the learning of the policy with multiple guiding distributions. Guided policy search relies on learning an underlying dynamical model of the environment and then, at each iteration of the algorithm, using that model to gradually improve the policy. This model, though, often makes the assumption that the environment dynamics are markovian, e.g., depend only on the current state and control signal. In this paper we apply guided policy search to a problem with non-markovian dynamics. Specifically, we apply it to the problem of pouring a precise amount of liquid from a cup into a bowl, where many of the sensor measurements experience non-trivial amounts of delay. We show that, with relatively simple state augmentation, guided policy search can be extended to non-markovian dynamical systems, where the non-markovianess is caused by delayed sensor readings.

* 2016 Quals Report for Connor Schenck in the Department of Computer Science & Engineering at the University of Washington

Via

Access Paper or Ask Questions

See the Glass Half Full: Reasoning about Liquid Containers, their Volume and Content

Sep 06, 2017

Roozbeh Mottaghi, Connor Schenck, Dieter Fox, Ali Farhadi

Figure 1 for See the Glass Half Full: Reasoning about Liquid Containers, their Volume and Content

Figure 2 for See the Glass Half Full: Reasoning about Liquid Containers, their Volume and Content

Figure 3 for See the Glass Half Full: Reasoning about Liquid Containers, their Volume and Content

Figure 4 for See the Glass Half Full: Reasoning about Liquid Containers, their Volume and Content

Abstract:Humans have rich understanding of liquid containers and their contents; for example, we can effortlessly pour water from a pitcher to a cup. Doing so requires estimating the volume of the cup, approximating the amount of water in the pitcher, and predicting the behavior of water when we tilt the pitcher. Very little attention in computer vision has been made to liquids and their containers. In this paper, we study liquid containers and their contents, and propose methods to estimate the volume of containers, approximate the amount of liquid in them, and perform comparative volume estimations all from a single RGB image. Furthermore, we show the results of the proposed model for predicting the behavior of liquids inside containers when one tilts the containers. We also introduce a new dataset of Containers Of liQuid contEnt (COQE) that contains more than 5,000 images of 10,000 liquid containers in context labelled with volume, amount of content, bounding box annotation, and corresponding similar 3D CAD models.

Via

Access Paper or Ask Questions

Reasoning About Liquids via Closed-Loop Simulation

Jun 09, 2017

Connor Schenck, Dieter Fox

Figure 1 for Reasoning About Liquids via Closed-Loop Simulation

Figure 2 for Reasoning About Liquids via Closed-Loop Simulation

Figure 3 for Reasoning About Liquids via Closed-Loop Simulation

Figure 4 for Reasoning About Liquids via Closed-Loop Simulation

Abstract:Simulators are powerful tools for reasoning about a robot's interactions with its environment. However, when simulations diverge from reality, that reasoning becomes less useful. In this paper, we show how to close the loop between liquid simulation and real-time perception. We use observations of liquids to correct errors when tracking the liquid's state in a simulator. Our results show that closed-loop simulation is an effective way to prevent large divergence between the simulated and real liquid states. As a direct consequence of this, our method can enable reasoning about liquids that would otherwise be infeasible due to large divergences, such as reasoning about occluded liquid.

* Robotics: Science & Systems (RSS), July 12-16, 2017. Cambridge, MA, USA

Via

Access Paper or Ask Questions

Visual Closed-Loop Control for Pouring Liquids

Feb 25, 2017

Connor Schenck, Dieter Fox

Figure 1 for Visual Closed-Loop Control for Pouring Liquids

Figure 2 for Visual Closed-Loop Control for Pouring Liquids

Figure 3 for Visual Closed-Loop Control for Pouring Liquids

Figure 4 for Visual Closed-Loop Control for Pouring Liquids

Abstract:Pouring a specific amount of liquid is a challenging task. In this paper we develop methods for robots to use visual feedback to perform closed-loop control for pouring liquids. We propose both a model-based and a model-free method utilizing deep learning for estimating the volume of liquid in a container. Our results show that the model-free method is better able to estimate the volume. We combine this with a simple PID controller to pour specific amounts of liquid, and show that the robot is able to achieve an average 38ml deviation from the target amount. To our knowledge, this is the first use of raw visual feedback to pour liquids in robotics.

* To appear at ICRA 2017

Via

Access Paper or Ask Questions

Towards Learning to Perceive and Reason About Liquids

Aug 02, 2016

Connor Schenck, Dieter Fox

Figure 1 for Towards Learning to Perceive and Reason About Liquids

Figure 2 for Towards Learning to Perceive and Reason About Liquids

Figure 3 for Towards Learning to Perceive and Reason About Liquids

Figure 4 for Towards Learning to Perceive and Reason About Liquids

Abstract:Recent advances in AI and robotics have claimed many incredible results with deep learning, yet no work to date has applied deep learning to the problem of liquid perception and reasoning. In this paper, we apply fully-convolutional deep neural networks to the tasks of detecting and tracking liquids. We evaluate three models: a single-frame network, multi-frame network, and a LSTM recurrent network. Our results show that the best liquid detection results are achieved when aggregating data over multiple frames and that the LSTM network outperforms the other two in both tasks. This suggests that LSTM-based neural networks have the potential to be a key component for enabling robots to handle liquids using robust, closed-loop controllers.

* Published in International Symposium on Experimental Robotics (ISER) 2016. arXiv admin note: text overlap with arXiv:1606.06266

Via

Access Paper or Ask Questions