Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhang Hu

InstructEngine: Instruction-driven Text-to-Image Alignment

Apr 14, 2025

Xingyu Lu, Yuhang Hu, YiFan Zhang, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Jinpeng Wang, Bin Wen, Chun Yuan, Fan Yang(+2 more)

Abstract:Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions

Robot Metabolism: Towards machines that can grow by consuming other machines

Nov 17, 2024

Philippe Martin Wyder, Riyaan Bakhda, Meiqi Zhao, Quinn A. Booth, Matthew E. Modi, Andrew Song, Simon Kang, Jiahao Wu, Priya Patel, Robert T. Kasumi(+10 more)

Figure 1 for Robot Metabolism: Towards machines that can grow by consuming other machines

Figure 2 for Robot Metabolism: Towards machines that can grow by consuming other machines

Figure 3 for Robot Metabolism: Towards machines that can grow by consuming other machines

Figure 4 for Robot Metabolism: Towards machines that can grow by consuming other machines

Abstract:Biological lifeforms can heal, grow, adapt, and reproduce -- abilities essential for sustained survival and development. In contrast, robots today are primarily monolithic machines with limited ability to self-repair, physically develop, or incorporate material from their environments. A key challenge to such physical adaptation has been that while robot minds are rapidly evolving new behaviors through AI, their bodies remain closed systems, unable to systematically integrate new material to grow or heal. We argue that open-ended physical adaptation is only possible when robots are designed using only a small repertoire of simple modules. This allows machines to mechanically adapt by consuming parts from other machines or their surroundings and shedding broken components. We demonstrate this principle using a truss modular robot platform composed of one-dimensional actuated bars. We show how robots in this space can grow bigger, faster, and more capable by consuming materials from their environment and from other robots. We suggest that machine metabolic processes akin to the one demonstrated here will be an essential part of any sustained future robot ecology.

* Manuscript combined with Supplementary Materials File for arXiv submission. Submitting to Journal and will update external DOI once available

Via

Access Paper or Ask Questions

Kwai-STaR: Transform LLMs into State-Transition Reasoners

Nov 07, 2024

Xingyu Lu, Yuhang Hu, Changyi Liu, Tianke Zhang, Zhenyu Yang, Zhixiang Ding, Shengsheng Qian, Meng Du, Ruiwen Kang, Kaiyu Tang(+5 more)

Abstract:Mathematical reasoning presents a significant challenge to the cognitive capabilities of LLMs. Various methods have been proposed to enhance the mathematical ability of LLMs. However, few recognize the value of state transition for LLM reasoning. In this work, we define mathematical problem-solving as a process of transiting from an initial unsolved state to the final resolved state, and propose Kwai-STaR framework, which transforms LLMs into State-Transition Reasoners to improve their intuitive reasoning capabilities. Our approach comprises three main steps: (1) Define the state space tailored to the mathematical reasoning. (2) Generate state-transition data based on the state space. (3) Convert original LLMs into State-Transition Reasoners via a curricular training strategy. Our experiments validate the effectiveness of Kwai-STaR in enhancing mathematical reasoning: After training on the small-scale Kwai-STaR dataset, general LLMs, including Mistral-7B and LLaMA-3, achieve considerable performance gain on the GSM8K and GSM-Hard dataset. Additionally, the state transition-based design endows Kwai-STaR with remarkable training and inference efficiency. Further experiments are underway to establish the generality of Kwai-STaR.

* 6 pages, 2 figures

Via

Access Paper or Ask Questions

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Apr 17, 2024

Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu(+58 more)

Figure 1 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Figure 2 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Figure 3 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Figure 4 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Abstract:This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.

* Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

Via

Access Paper or Ask Questions

Reconfigurable Robot Identification from Motion Data

Mar 15, 2024

Yuhang Hu, Yunzhe Wang, Ruibo Liu, Zhou Shen, Hod Lipson

Abstract:Integrating Large Language Models (VLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta self modeling that can deduce robot morphology through proprioception (the internal sense of position and movement). Our study introduces a 12 DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance understanding of their physical embodiment and adaptability in real world scenarios.

Via

Access Paper or Ask Questions

Teaching Robots to Build Simulations of Themselves

Nov 20, 2023

Yuhang Hu, Jiong Lin, Hod Lipson

Abstract:Simulation enables robots to plan and estimate the outcomes of prospective actions without the need to physically execute them. We introduce a self-supervised learning framework to enable robots model and predict their morphology, kinematics and motor control using only brief raw video data, eliminating the need for extensive real-world data collection and kinematic priors. By observing their own movements, akin to humans watching their reflection in a mirror, robots learn an ability to simulate themselves and predict their spatial motion for various tasks. Our results demonstrate that this self-learned simulation not only enables accurate motion planning but also allows the robot to detect abnormalities and recover from damage.

Via

Access Paper or Ask Questions

Knolling bot 2.0: Enhancing Object Organization with Self-supervised Graspability Estimation

Oct 30, 2023

Yuhang Hu, Zhizhuo Zhang, Hod Lipson

Abstract:Building on recent advancements in transformer based approaches for domestic robots performing knolling, the art of organizing scattered items into neat arrangements. This paper introduces Knolling bot 2.0. Recognizing the challenges posed by piles of objects or items situated closely together, this upgraded system incorporates a self-supervised graspability estimation model. If objects are deemed ungraspable, an additional behavior will be executed to separate the objects before knolling the table. By integrating this grasp prediction mechanism with existing visual perception and transformer based knolling models, an advanced system capable of decluttering and organizing even more complex and densely populated table settings is demonstrated. Experimental evaluations demonstrate the effectiveness of this module, yielding a graspability prediction accuracy of 95.7%.

* This paper has been accepted by the NeurIPS 2023 Robot Learning Workshop

Via

Access Paper or Ask Questions

Knolling bot: A Transformer-based Approach to Organizing a Messy Table

Oct 06, 2023

Yuhang Hu, Zhizhuo Zhang, Ruibo Liu, Philippe Wyder, Hod Lipson

Abstract:In this study, we propose an approach to equip domestic robots with the ability to perform simple household tidying tasks. We focus specifically on 'knolling,' an activity related to organizing scattered items into neat and space-efficient arrangements. Unlike the uniformity of industrial environments, household settings present unique challenges due to their diverse array of items and the subjectivity of tidiness. Here, we draw inspiration from natural language processing (NLP) and utilize a transformer-based approach that predicts the next position of an item in a sequence of neatly positioned items. We integrate the knolling model with a visual perception model and a physical robot arm to demonstrate a machine that declutters and organizes a dozen freeform items of various shapes and sizes.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

On the Origins of Self-Modeling

Sep 05, 2022

Robert Kwiatkowski, Yuhang Hu, Boyuan Chen, Hod Lipson

Figure 1 for On the Origins of Self-Modeling

Figure 2 for On the Origins of Self-Modeling

Figure 3 for On the Origins of Self-Modeling

Figure 4 for On the Origins of Self-Modeling

Abstract:Self-Modeling is the process by which an agent, such as an animal or machine, learns to create a predictive model of its own dynamics. Once captured, this self-model can then allow the agent to plan and evaluate various potential behaviors internally using the self-model, rather than using costly physical experimentation. Here, we quantify the benefits of such self-modeling against the complexity of the robot. We find a R2 =0.90 correlation between the number of degrees of freedom a robot has, and the added value of self-modeling as compared to a direct learning baseline. This result may help motivate self modeling in increasingly complex robotic systems, as well as shed light on the origins of self-modeling, and ultimately self-awareness, in animals and humans.

* arXiv admin note: substantial text overlap with arXiv:1910.01994

Via

Access Paper or Ask Questions

Egocentric Visual Self-Modeling for Legged Robot Locomotion

Jul 07, 2022

Yuhang Hu, Boyuan Chen, Hod Lipson

Figure 1 for Egocentric Visual Self-Modeling for Legged Robot Locomotion

Figure 2 for Egocentric Visual Self-Modeling for Legged Robot Locomotion

Figure 3 for Egocentric Visual Self-Modeling for Legged Robot Locomotion

Figure 4 for Egocentric Visual Self-Modeling for Legged Robot Locomotion

Abstract:Inertial Measurement Unit (IMU) is ubiquitous in robotic research. It provides posture information for robots to realize balance and navigation. However, humans and animals can perceive the movement of their bodies in the environment without precise orientation or position values. This interaction inherently involves a fast feedback loop between perception and action. This work proposed an end-to-end approach that uses high dimension visual observation and action commands to train a visual self-model for legged locomotion. The visual self-model learns the spatial relationship between the robot body movement and the ground texture changes from image sequences. We demonstrate that the robot can leverage the visual self-model to achieve various locomotion tasks in the real-world environment that the robot does not see during training. With our proposed method, robots can do locomotion without IMU or in an environment with no GPS or weak geomagnetic fields like the indoor and urban canyons in the city.

Via

Access Paper or Ask Questions