Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Atsushi Yamashita

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

Jan 16, 2025

Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang(+5 more)

Abstract:Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).

* An extension of SMPLer-X [arXiv:2309.17448]. Homepage: https://caizhongang.com/projects/SMPLer-X/

Via

Access Paper or Ask Questions

DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models

Nov 13, 2024

Yongdong Wang, Runze Xiao, Jun Younes Louhi Kasahara, Ryosuke Yajima, Keiji Nagatani, Atsushi Yamashita, Hajime Asama

Figure 1 for DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models

Figure 2 for DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models

Figure 3 for DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models

Figure 4 for DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models

Abstract:Large Language Models (LLMs) have demonstrated significant reasoning capabilities in robotic systems. However, their deployment in multi-robot systems remains fragmented and struggles to handle complex task dependencies and parallel execution. This study introduces the DART-LLM (Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models) system, designed to address these challenges. DART-LLM utilizes LLMs to parse natural language instructions, decomposing them into multiple subtasks with dependencies to establish complex task sequences, thereby enhancing efficient coordination and parallel execution in multi-robot systems. The system includes the QA LLM module, Breakdown Function modules, Actuation module, and a Vision-Language Model (VLM)-based object detection module, enabling task decomposition and execution from natural language instructions to robotic actions. Experimental results demonstrate that DART-LLM excels in handling long-horizon tasks and collaborative tasks with complex dependencies. Even when using smaller models like Llama 3.1 8B, the system achieves good performance, highlighting DART-LLM's robustness in terms of model size. Please refer to the project website \url{https://wyd0817.github.io/project-dart-llm/} for videos and code.

* Submitted to the 2025 IEEE International Conference on Robotics & Automation on September 15, 2024

Via

Access Paper or Ask Questions

State-Free Inference of State-Space Models: The Transfer Function Approach

May 10, 2024

Rom N. Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T. H. Smith, Ramin Hasani, Mathias Lechner, Qi An, Christopher Ré, Hajime Asama, Stefano Ermon(+3 more)

Abstract:We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed algorithms, state-free inference does not incur any significant memory or computational cost with an increase in state size. We achieve this using properties of the proposed frequency domain transfer function parametrization, which enables direct computation of its corresponding convolutional kernel's spectrum via a single Fast Fourier Transform. Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers -- parametrized in time-domain -- on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. Moreover, we report improved perplexity in language modeling over a long convolutional Hyena baseline, by simply introducing our transfer function parametrization. Our code is available at https://github.com/ruke1ire/RTF.

Via

Access Paper or Ask Questions

WHAC: World-grounded Humans and Cameras

Mar 19, 2024

Wanqi Yin, Zhongang Cai, Ruisi Wang, Fanzhou Wang, Chen Wei, Haiyi Mei, Weiye Xiao, Zhitao Yang, Qingping Sun, Atsushi Yamashita(+2 more)

Figure 1 for WHAC: World-grounded Humans and Cameras

Figure 2 for WHAC: World-grounded Humans and Cameras

Figure 3 for WHAC: World-grounded Humans and Cameras

Figure 4 for WHAC: World-grounded Humans and Cameras

Abstract:Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available.

* Homepage: https://wqyin.github.io/projects/WHAC/

Via

Access Paper or Ask Questions

Motion Degeneracy in Self-supervised Learning of Elevation Angle Estimation for 2D Forward-Looking Sonar

Aug 01, 2023

Yusheng Wang, Yonghoon Ji, Chujie Wu, Hiroshi Tsuchiya, Hajime Asama, Atsushi Yamashita

Abstract:2D forward-looking sonar is a crucial sensor for underwater robotic perception. A well-known problem in this field is estimating missing information in the elevation direction during sonar imaging. There are demands to estimate 3D information per image for 3D mapping and robot navigation during fly-through missions. Recent learning-based methods have demonstrated their strengths, but there are still drawbacks. Supervised learning methods have achieved high-quality results but may require further efforts to acquire 3D ground-truth labels. The existing self-supervised method requires pretraining using synthetic images with 3D supervision. This study aims to realize stable self-supervised learning of elevation angle estimation without pretraining using synthetic images. Failures during self-supervised learning may be caused by motion degeneracy problems. We first analyze the motion field of 2D forward-looking sonar, which is related to the main supervision signal. We utilize a modern learning framework and prove that if the training dataset is built with effective motions, the network can be trained in a self-supervised manner without the knowledge of synthetic data. Both simulation and real experiments validate the proposed method.

* IROS2023

Via

Access Paper or Ask Questions

2D Forward Looking Sonar Simulation with Ground Echo Modeling

Apr 17, 2023

Yusheng Wang, Chujie Wu, Yonghoon Ji, Hiroshi Tsuchiya, Hajime Asama, Atsushi Yamashita

Abstract:Imaging sonar produces clear images in underwater environments, independent of water turbidity and lighting conditions. The next generation 2D forward looking sonars are compact in size and able to generate high-resolution images which facilitate underwater robotics research. Considering the difficulties and expenses of implementing experiments in underwater environments, tremendous work has been focused on sonar image simulation. However, sonar artifacts like multi-path reflection were not sufficiently discussed, which cannot be ignored in water tank environments. In this paper, we focus on the influence of echoes from the flat ground. We propose a method to simulate the ground echo effect physically in acoustic images. We model the multi-bounce situations using the single-bounce framework for computation efficiency. We compare the real image captured in the water tank with the synthetic images to validate the proposed methods.

Via

Access Paper or Ask Questions

Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

Jul 30, 2022

Yusheng Wang, Yonghoon Ji, Hiroshi Tsuchiya, Hajime Asama, Atsushi Yamashita

Figure 1 for Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

Figure 2 for Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

Figure 3 for Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

Figure 4 for Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

Abstract:Retrieving the missing dimension information in acoustic images from 2D forward-looking sonar is a well-known problem in the field of underwater robotics. There are works attempting to retrieve 3D information from a single image which allows the robot to generate 3D maps with fly-through motion. However, owing to the unique image formulation principle, estimating 3D information from a single image faces severe ambiguity problems. Classical methods of multi-view stereo can avoid the ambiguity problems, but may require a large number of viewpoints to generate an accurate model. In this work, we propose a novel learning-based multi-view stereo method to estimate 3D information. To better utilize the information from multiple frames, an elevation plane sweeping method is proposed to generate the depth-azimuth-elevation cost volume. The volume after regularization can be considered as a probabilistic volumetric representation of the target. Instead of performing regression on the elevation angles, we use pseudo front depth from the cost volume to represent the 3D information which can avoid the 2D-3D problem in acoustic imaging. High-accuracy results can be generated with only two or three images. Synthetic datasets were generated to simulate various underwater targets. We also built the first real dataset with accurate ground truth in a large scale water tank. Experimental results demonstrate the superiority of our method, compared to other state-of-the-art methods.

* Accepted at IROS 2022

Via

Access Paper or Ask Questions

Efficient Video Deblurring Guided by Motion Magnitude

Jul 27, 2022

Yusheng Wang, Yunfan Lu, Ye Gao, Lin Wang, Zhihang Zhong, Yinqiang Zheng, Atsushi Yamashita

Figure 1 for Efficient Video Deblurring Guided by Motion Magnitude

Figure 2 for Efficient Video Deblurring Guided by Motion Magnitude

Figure 3 for Efficient Video Deblurring Guided by Motion Magnitude

Figure 4 for Efficient Video Deblurring Guided by Motion Magnitude

Abstract:Video deblurring is a highly under-constrained problem due to the spatially and temporally varying blur. An intuitive approach for video deblurring includes two steps: a) detecting the blurry region in the current frame; b) utilizing the information from clear regions in adjacent frames for current frame deblurring. To realize this process, our idea is to detect the pixel-wise blur level of each frame and combine it with video deblurring. To this end, we propose a novel framework that utilizes the motion magnitude prior (MMP) as guidance for efficient deep video deblurring. Specifically, as the pixel movement along its trajectory during the exposure time is positively correlated to the level of motion blur, we first use the average magnitude of optical flow from the high-frequency sharp frames to generate the synthetic blurry frames and their corresponding pixel-wise motion magnitude maps. We then build a dataset including the blurry frame and MMP pairs. The MMP is then learned by a compact CNN by regression. The MMP consists of both spatial and temporal blur level information, which can be further integrated into an efficient recurrent neural network (RNN) for video deblurring. We conduct intensive experiments to validate the effectiveness of the proposed methods on the public datasets.

* Accepted at ECCV2022

Via

Access Paper or Ask Questions

ReIL: A Framework for Reinforced Intervention-based Imitation Learning

Mar 29, 2022

Rom Parnichkun, Matthew N. Dailey, Atsushi Yamashita

Figure 1 for ReIL: A Framework for Reinforced Intervention-based Imitation Learning

Figure 2 for ReIL: A Framework for Reinforced Intervention-based Imitation Learning

Figure 3 for ReIL: A Framework for Reinforced Intervention-based Imitation Learning

Figure 4 for ReIL: A Framework for Reinforced Intervention-based Imitation Learning

Abstract:Compared to traditional imitation learning methods such as DAgger and DART, intervention-based imitation offers a more convenient and sample efficient data collection process to users. In this paper, we introduce Reinforced Intervention-based Learning (ReIL), a framework consisting of a general intervention-based learning algorithm and a multi-task imitation learning model aimed at enabling non-expert users to train agents in real environments with little supervision or fine tuning. ReIL achieves this with an algorithm that combines the advantages of imitation learning and reinforcement learning and a model capable of concurrently processing demonstrations, past experience, and current observations. Experimental results from real world mobile robot navigation challenges indicate that ReIL learns rapidly from sparse supervisor corrections without suffering deterioration in performance that is characteristic of supervised learning-based methods such as HG-Dagger and IWR. The results also demonstrate that in contrast to other intervention-based methods such as IARL and EGPO, ReIL can utilize an arbitrary reward function for training without any additional heuristics.

Via

Access Paper or Ask Questions

Continuous-Depth Neural Models for Dynamic Graph Prediction

Jun 22, 2021

Michael Poli, Stefano Massaroli, Clayton M. Rabideau, Junyoung Park, Atsushi Yamashita, Hajime Asama, Jinkyoo Park

Figure 1 for Continuous-Depth Neural Models for Dynamic Graph Prediction

Figure 2 for Continuous-Depth Neural Models for Dynamic Graph Prediction

Figure 3 for Continuous-Depth Neural Models for Dynamic Graph Prediction

Figure 4 for Continuous-Depth Neural Models for Dynamic Graph Prediction

Abstract:We introduce the framework of continuous-depth graph neural networks (GNNs). Neural graph differential equations (Neural GDEs) are formalized as the counterpart to GNNs where the input-output relationship is determined by a continuum of GNN layers, blending discrete topological structures and differential equations. The proposed framework is shown to be compatible with static GNN models and is extended to dynamic and stochastic settings through hybrid dynamical system theory. Here, Neural GDEs improve performance by exploiting the underlying dynamics geometry, further introducing the ability to accommodate irregularly sampled data. Results prove the effectiveness of the proposed models across applications, such as traffic forecasting or prediction in genetic regulatory networks.

* Extended version of the workshop paper "Graph Neural Ordinary Differential Equations". arXiv admin note: substantial text overlap with arXiv:1911.07532

Via

Access Paper or Ask Questions