Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changhyun Choi

Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models

Jun 14, 2025

Changhyun Choi, Sungha Kim, H. Jin Kim

Abstract:Recently, it has been shown that investing computing resources in searching for good initial noise for a text-to-image diffusion model helps improve performance. However, previous studies required external models to evaluate the resulting images, which is impossible on GPUs with small VRAM. For these reasons, we apply Best-of-N inference-time scaling to algorithms that optimize the initial noise of a diffusion model without external models across multiple datasets and backbones. We demonstrate that inference-time scaling for text-to-image diffusion models in this setting quickly reaches a performance plateau, and a relatively small number of optimization steps suffices to achieve the maximum achievable performance with each algorithm.

* MOSS workshop at ICML 2025 accepted

Via

Access Paper or Ask Questions

Optimal Robotic Velcro Peeling with Force Feedback

Jun 06, 2025

Jiacheng Yuan, Changhyun Choi, Volkan Isler

Abstract:We study the problem of peeling a Velcro strap from a surface using a robotic manipulator. The surface geometry is arbitrary and unknown. The robot has access to only the force feedback and its end-effector position. This problem is challenging due to the partial observability of the environment and the incompleteness of the sensor feedback. To solve it, we first model the system with simple analytic state and action models based on quasi-static dynamics assumptions. We then study the fully-observable case where the state of both the Velcro and the robot are given. For this case, we obtain the optimal solution in closed-form which minimizes the total energy cost. Next, for the partially-observable case, we design a state estimator which estimates the underlying state using only force and position feedback. Then, we present a heuristics-based controller that balances exploratory and exploitative behaviors in order to peel the velcro efficiently. Finally, we evaluate our proposed method in environments with complex geometric uncertainties and sensor noises, achieving 100% success rate with less than 80% increase in energy cost compared to the optimal solution when the environment is fully-observable, outperforming the baselines by a large margin.

Via

Access Paper or Ask Questions

Attribute-Based Robotic Grasping with Data-Efficient Adaptation

Jan 04, 2025

Yang Yang, Houjian Yu, Xibai Lou, Yuanhao Liu, Changhyun Choi

Figure 1 for Attribute-Based Robotic Grasping with Data-Efficient Adaptation

Figure 2 for Attribute-Based Robotic Grasping with Data-Efficient Adaptation

Figure 3 for Attribute-Based Robotic Grasping with Data-Efficient Adaptation

Figure 4 for Attribute-Based Robotic Grasping with Data-Efficient Adaptation

Abstract:Robotic grasping is one of the most fundamental robotic manipulation tasks and has been the subject of extensive research. However, swiftly teaching a robot to grasp a novel target object in clutter remains challenging. This paper attempts to address the challenge by leveraging object attributes that facilitate recognition, grasping, and rapid adaptation to new domains. In this work, we present an end-to-end encoder-decoder network to learn attribute-based robotic grasping with data-efficient adaptation capability. We first pre-train the end-to-end model with a variety of basic objects to learn generic attribute representation for recognition and grasping. Our approach fuses the embeddings of a workspace image and a query text using a gated-attention mechanism and learns to predict instance grasping affordances. To train the joint embedding space of visual and textual attributes, the robot utilizes object persistence before and after grasping. Our model is self-supervised in a simulation that only uses basic objects of various colors and shapes but generalizes to novel objects in new environments. To further facilitate generalization, we propose two adaptation methods, adversarial adaption and one-grasp adaptation. Adversarial adaptation regulates the image encoder using augmented data of unlabeled images, whereas one-grasp adaptation updates the overall end-to-end model using augmented data from one grasp trial. Both adaptation methods are data-efficient and considerably improve instance grasping performance. Experimental results in both simulation and the real world demonstrate that our approach achieves over 81% instance grasping success rate on unknown objects, which outperforms several baselines by large margins.

* IEEE Transactions on Robotics, vol. 40, pp. 1566-1579, 2024
* Project page: https://z.umn.edu/attr-grasp. arXiv admin note: substantial text overlap with arXiv:2104.02271

Via

Access Paper or Ask Questions

Learning for Deformable Linear Object Insertion Leveraging Flexibility Estimation from Visual Cues

Oct 30, 2024

Mingen Li, Changhyun Choi

Figure 1 for Learning for Deformable Linear Object Insertion Leveraging Flexibility Estimation from Visual Cues

Figure 2 for Learning for Deformable Linear Object Insertion Leveraging Flexibility Estimation from Visual Cues

Figure 3 for Learning for Deformable Linear Object Insertion Leveraging Flexibility Estimation from Visual Cues

Figure 4 for Learning for Deformable Linear Object Insertion Leveraging Flexibility Estimation from Visual Cues

Abstract:Manipulation of deformable Linear objects (DLOs), including iron wire, rubber, silk, and nylon rope, is ubiquitous in daily life. These objects exhibit diverse physical properties, such as Young$'$s modulus and bending stiffness.Such diversity poses challenges for developing generalized manipulation policies. However, previous research limited their scope to single-material DLOs and engaged in time-consuming data collection for the state estimation. In this paper, we propose a two-stage manipulation approach consisting of a material property (e.g., flexibility) estimation and policy learning for DLO insertion with reinforcement learning. Firstly, we design a flexibility estimation scheme that characterizes the properties of different types of DLOs. The ground truth flexibility data is collected in simulation to train our flexibility estimation module. During the manipulation, the robot interacts with the DLOs to estimate flexibility by analyzing their visual configurations. Secondly, we train a policy conditioned on the estimated flexibility to perform challenging DLO insertion tasks. Our pipeline trained with diverse insertion scenarios achieves an 85.6% success rate in simulation and 66.67% in real robot experiments. Please refer to our project page: https://lmeee.github.io/DLOInsert/

* 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 5183-5189
* 7 pages, 9 figures, 3 tables. 2024 IEEE International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Sep 28, 2024

Houjian Yu, Mingen Li, Alireza Rezazadeh, Yang Yang, Changhyun Choi

Figure 1 for A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Figure 2 for A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Figure 3 for A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Figure 4 for A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Abstract:The language-guided robot grasping task requires a robot agent to integrate multimodal information from both visual and linguistic inputs to predict actions for target-driven grasping. While recent approaches utilizing Multimodal Large Language Models (MLLMs) have shown promising results, their extensive computation and data demands limit the feasibility of local deployment and customization. To address this, we propose a novel CLIP-based multimodal parameter-efficient tuning (PET) framework designed for three language-guided object grounding and grasping tasks: (1) Referring Expression Segmentation (RES), (2) Referring Grasp Synthesis (RGS), and (3) Referring Grasp Affordance (RGA). Our approach introduces two key innovations: a bi-directional vision-language adapter that aligns multimodal inputs for pixel-level language understanding and a depth fusion branch that incorporates geometric cues to facilitate robot grasping predictions. Experiment results demonstrate superior performance in the RES object grounding task compared with existing CLIP-based full-model tuning or PET approaches. In the RGS and RGA tasks, our model not only effectively interprets object attributes based on simple language descriptions but also shows strong potential for comprehending complex spatial reasoning scenarios, such as multiple identical objects present in the workspace.

* This work has been submitted to ICRA 2025

Via

Access Paper or Ask Questions

SlotGNN: Unsupervised Discovery of Multi-Object Representations and Visual Dynamics

Oct 06, 2023

Alireza Rezazadeh, Athreyi Badithela, Karthik Desingh, Changhyun Choi

Figure 1 for SlotGNN: Unsupervised Discovery of Multi-Object Representations and Visual Dynamics

Figure 2 for SlotGNN: Unsupervised Discovery of Multi-Object Representations and Visual Dynamics

Figure 3 for SlotGNN: Unsupervised Discovery of Multi-Object Representations and Visual Dynamics

Figure 4 for SlotGNN: Unsupervised Discovery of Multi-Object Representations and Visual Dynamics

Abstract:Learning multi-object dynamics from visual data using unsupervised techniques is challenging due to the need for robust, object representations that can be learned through robot interactions. This paper presents a novel framework with two new architectures: SlotTransport for discovering object representations from RGB images and SlotGNN for predicting their collective dynamics from RGB images and robot interactions. Our SlotTransport architecture is based on slot attention for unsupervised object discovery and uses a feature transport mechanism to maintain temporal alignment in object-centric representations. This enables the discovery of slots that consistently reflect the composition of multi-object scenes. These slots robustly bind to distinct objects, even under heavy occlusion or absence. Our SlotGNN, a novel unsupervised graph-based dynamics model, predicts the future state of multi-object scenes. SlotGNN learns a graph representation of the scene using the discovered slots from SlotTransport and performs relational and spatial reasoning to predict the future appearance of each slot conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning object-centric features that accurately encode both visual and positional information. Further, we highlight the accuracy of SlotGNN in downstream robotic tasks, including challenging multi-object rearrangement and long-horizon prediction. Finally, our unsupervised approach proves effective in the real world. With only minimal additional data, our framework robustly predicts slots and their corresponding dynamics in real-world control tasks.

Via

Access Paper or Ask Questions

Adversarial Object Rearrangement in Constrained Environments with Heterogeneous Graph Neural Networks

Sep 27, 2023

Xibai Lou, Houjian Yu, Ross Worobel, Yang Yang, Changhyun Choi

Abstract:Adversarial object rearrangement in the real world (e.g., previously unseen or oversized items in kitchens and stores) could benefit from understanding task scenes, which inherently entail heterogeneous components such as current objects, goal objects, and environmental constraints. The semantic relationships among these components are distinct from each other and crucial for multi-skilled robots to perform efficiently in everyday scenarios. We propose a hierarchical robotic manipulation system that learns the underlying relationships and maximizes the collaborative power of its diverse skills (e.g., pick-place, push) for rearranging adversarial objects in constrained environments. The high-level coordinator employs a heterogeneous graph neural network (HetGNN), which reasons about the current objects, goal objects, and environmental constraints; the low-level 3D Convolutional Neural Network-based actors execute the action primitives. Our approach is trained entirely in simulation, and achieved an average success rate of 87.88% and a planning cost of 12.82 in real-world experiments, surpassing all baseline methods. Supplementary material is available at https://sites.google.com/umn.edu/versatile-rearrangement.

* Accepted for publication in IROS 2023

Via

Access Paper or Ask Questions

IOSG: Image-driven Object Searching and Grasping

Aug 10, 2023

Houjian Yu, Xibai Lou, Yang Yang, Changhyun Choi

Figure 1 for IOSG: Image-driven Object Searching and Grasping

Figure 2 for IOSG: Image-driven Object Searching and Grasping

Figure 3 for IOSG: Image-driven Object Searching and Grasping

Figure 4 for IOSG: Image-driven Object Searching and Grasping

Abstract:When robots retrieve specific objects from cluttered scenes, such as home and warehouse environments, the target objects are often partially occluded or completely hidden. Robots are thus required to search, identify a target object, and successfully grasp it. Preceding works have relied on pre-trained object recognition or segmentation models to find the target object. However, such methods require laborious manual annotations to train the models and even fail to find novel target objects. In this paper, we propose an Image-driven Object Searching and Grasping (IOSG) approach where a robot is provided with the reference image of a novel target object and tasked to find and retrieve it. We design a Target Similarity Network that generates a probability map to infer the location of the novel target. IOSG learns a hierarchical policy; the high-level policy predicts the subtask type, whereas the low-level policies, explorer and coordinator, generate effective push and grasp actions. The explorer is responsible for searching the target object when it is hidden or occluded by other objects. Once the target object is found, the coordinator conducts target-oriented pushing and grasping to retrieve the target from the clutter. The proposed pipeline is trained with full self-supervision in simulation and applied to a real environment. Our model achieves a 96.0% and 94.5% task success rate on coordination and exploration tasks in simulation respectively, and 85.0% success rate on a real robot for the search-and-grasp task.

* Accepted to IEEE/RSJ International Conference on Intelligent Robots (IROS 2023). Project page: https://sites.google.com/umn.edu/iosg

Via

Access Paper or Ask Questions

Active Mass Distribution Estimation from Tactile Feedback

Mar 02, 2023

Jiacheng Yuan, Changhyun Choi, Ellad B. Tadmor, Volkan Isler

Abstract:In this work, we present a method to estimate the mass distribution of a rigid object through robotic interactions and tactile feedback. This is a challenging problem because of the complexity of physical dynamics modeling and the action dependencies across the model parameters. We propose a sequential estimation strategy combined with a set of robot action selection rules based on the analytical formulation of a discrete-time dynamics model. To evaluate the performance of our approach, we also manufactured re-configurable block objects that allow us to modify the object mass distribution while having access to the ground truth values. We compare our approach against multiple baselines and show that our approach can estimate the mass distribution with around 10% error, while the baselines have errors ranging from 18% to 68%.

Via

Access Paper or Ask Questions

Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Jul 20, 2022

Houjian Yu, Changhyun Choi

Figure 1 for Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Figure 2 for Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Figure 3 for Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Figure 4 for Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach

Abstract:Instance segmentation with unseen objects is a challenging problem in unstructured environments. To solve this problem, we propose a robot learning approach to actively interact with novel objects and collect each object's training label for further fine-tuning to improve the segmentation model performance, while avoiding the time-consuming process of manually labeling a dataset. The Singulation-and-Grasping (SaG) policy is trained through end-to-end reinforcement learning. Given a cluttered pile of objects, our approach chooses pushing and grasping motions to break the clutter and conducts object-agnostic grasping for which the SaG policy takes as input the visual observations and imperfect segmentation. We decompose the problem into three subtasks: (1) the object singulation subtask aims to separate the objects from each other, which creates more space that alleviates the difficulty of (2) the collision-free grasping subtask; (3) the mask generation subtask to obtain the self-labeled ground truth masks by using an optical flow-based binary classifier and motion cue post-processing for transfer learning. Our system achieves 70% singulation success rate in simulated cluttered scenes. The interactive segmentation of our system achieves 87.8%, 73.9%, and 69.3% average precision for toy blocks, YCB objects in simulation and real-world novel objects, respectively, which outperforms several baselines.

* Accepted to ECCV 2022

Via

Access Paper or Ask Questions