Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zixuan Huang

AnoF-Diff: One-Step Diffusion-Based Anomaly Detection for Forceful Tool Use

Sep 18, 2025

Yating Lin, Zixuan Huang, Fan Yang, Dmitry Berenson

Abstract:Multivariate time-series anomaly detection, which is critical for identifying unexpected events, has been explored in the field of machine learning for several decades. However, directly applying these methods to data from forceful tool use tasks is challenging because streaming sensor data in the real world tends to be inherently noisy, exhibits non-stationary behavior, and varies across different tasks and tools. To address these challenges, we propose a method, AnoF-Diff, based on the diffusion model to extract force-torque features from time-series data and use force-torque features to detect anomalies. We compare our method with other state-of-the-art methods in terms of F1-score and Area Under the Receiver Operating Characteristic curve (AUROC) on four forceful tool-use tasks, demonstrating that our method has better performance and is more robust to a noisy dataset. We also propose the method of parallel anomaly score evaluation based on one-step diffusion and demonstrate how our method can be used for online anomaly detection in several forceful tool use experiments.

Via

Access Paper or Ask Questions

Planning-Query-Guided Model Generation for Model-Based Deformable Object Manipulation

Aug 26, 2025

Alex LaGrassa, Zixuan Huang, Dmitry Berenson, Oliver Kroemer

Abstract:Efficient planning in high-dimensional spaces, such as those involving deformable objects, requires computationally tractable yet sufficiently expressive dynamics models. This paper introduces a method that automatically generates task-specific, spatially adaptive dynamics models by learning which regions of the object require high-resolution modeling to achieve good task performance for a given planning query. Task performance depends on the complex interplay between the dynamics model, world dynamics, control, and task requirements. Our proposed diffusion-based model generator predicts per-region model resolutions based on start and goal pointclouds that define the planning query. To efficiently collect the data for learning this mapping, a two-stage process optimizes resolution using predictive dynamics as a prior before directly optimizing using closed-loop performance. On a tree-manipulation task, our method doubles planning speed with only a small decrease in task performance over using a full-resolution model. This approach informs a path towards using previous planning and control data to generate computationally efficient yet sufficiently expressive dynamics models for new tasks.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

May 26, 2025

Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

Abstract:This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.

Via

Access Paper or Ask Questions

An ACO-MPC Framework for Energy-Efficient and Collision-Free Path Planning in Autonomous Maritime Navigation

Apr 22, 2025

Yaoze Liu, Zhen Tian, Qifan Zhou, Zixuan Huang, Hongyu Sun

Abstract:Automated driving on ramps presents significant challenges due to the need to balance both safety and efficiency during lane changes. This paper proposes an integrated planner for automated vehicles (AVs) on ramps, utilizing an unsatisfactory level metric for efficiency and arrow-cluster-based sampling for safety. The planner identifies optimal times for the AV to change lanes, taking into account the vehicle's velocity as a key factor in efficiency. Additionally, the integrated planner employs arrow-cluster-based sampling to evaluate collision risks and select an optimal lane-changing curve. Extensive simulations were conducted in a ramp scenario to verify the planner's efficient and safe performance. The results demonstrate that the proposed planner can effectively select an appropriate lane-changing time point and a safe lane-changing curve for AVs, without incurring any collisions during the maneuver.

* This paper has been accepted by the 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2025)

Via

Access Paper or Ask Questions

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Jan 08, 2025

Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani

Figure 1 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Figure 2 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Figure 3 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Figure 4 for SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Abstract:We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: https://spar3d.github.io

Via

Access Paper or Ask Questions

Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Nov 26, 2024

Xiang Li, Zixuan Huang, Anh Thai, James M. Rehg

Abstract:Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.

* Project page: https://ryanxli.github.io/reflect3d/

Via

Access Paper or Ask Questions

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

Nov 21, 2024

Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, Richong Zhang

Figure 1 for MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

Figure 2 for MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

Figure 3 for MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

Figure 4 for MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

Abstract:Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. While existing benchmarks for evaluating LMMs mainly focus on image comprehension, few works evaluate them from the image generation perspective. To address this issue, we propose a straightforward automated evaluation pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt from a given input image. Subsequently, it employs text-to-image generative models to create a new image based on these generated prompts. Finally, we evaluate the performance of LMMs by comparing the original image with the generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive benchmark developed to evaluate LMMs across 13 distinct image patterns, and MMGenBench-Domain, targeting the performance evaluation of LMMs within the generative image domain. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability in both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks, related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, our pipeline facilitates the efficient assessment of LMMs performance across diverse domains by using solely image inputs.

* This project is available at: https://github.com/lerogo/MMGenBench

Via

Access Paper or Ask Questions

Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion

Oct 21, 2024

Zixuan Huang, Yinong He, Yating Lin, Dmitry Berenson

Figure 1 for Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion

Figure 2 for Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion

Figure 3 for Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion

Figure 4 for Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion

Abstract:Long-horizon contact-rich manipulation has long been a challenging problem, as it requires reasoning over both discrete contact modes and continuous object motion. We introduce Implicit Contact Diffuser (ICD), a diffusion-based model that generates a sequence of neural descriptors that specify a series of contact relationships between the object and the environment. This sequence is then used as guidance for an MPC method to accomplish a given task. The key advantage of this approach is that the latent descriptors provide more task-relevant guidance to MPC, helping to avoid local minima for contact-rich manipulation tasks. Our experiments demonstrate that ICD outperforms baselines on complex, long-horizon, contact-rich manipulation tasks, such as cable routing and notebook folding. Additionally, our experiments also indicate that \methodshort can generalize a target contact relationship to a different environment. More visualizations can be found on our website $\href{https://implicit-contact-diffuser.github.io/}{https://implicit-contact-diffuser.github.io}$

* In submussion

Via

Access Paper or Ask Questions

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Aug 01, 2024

Mark Boss, Zixuan Huang, Aaryaman Vasishta, Varun Jampani

Figure 1 for SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Figure 2 for SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Figure 3 for SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Figure 4 for SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Abstract:We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. Unlike most existing approaches, SF3D is explicitly trained for mesh generation, incorporating a fast UV unwrapping technique that enables swift texture generation rather than relying on vertex colors. The method also learns to predict material parameters and normal maps to enhance the visual quality of the reconstructed 3D meshes. Furthermore, SF3D integrates a delighting step to effectively remove low-frequency illumination effects, ensuring that the reconstructed meshes can be easily used in novel illumination conditions. Experiments demonstrate the superior performance of SF3D over the existing techniques. Project page: https://stable-fast-3d.github.io

Via

Access Paper or Ask Questions

Multi-beam Training for Near-field Communications in High-frequency Bands

Jun 21, 2024

Cong Zhou, Changsheng You, Zixuan Huang, Shuo Shi, Yi Gong, Chan-Byoung Chae, Kaibin Huang

Figure 1 for Multi-beam Training for Near-field Communications in High-frequency Bands

Figure 2 for Multi-beam Training for Near-field Communications in High-frequency Bands

Figure 3 for Multi-beam Training for Near-field Communications in High-frequency Bands

Figure 4 for Multi-beam Training for Near-field Communications in High-frequency Bands

Abstract:In this paper, we study efficient multi-beam training design for near-field communications to reduce the beam training overhead of conventional single-beam training methods. In particular, the array-division based multi-beam training method, which is widely used in far-field communications, cannot be directly applied to the near-field scenario, since different sub-arrays may observe different user angles and there exist coverage holes in the angular domain. To address these issues, we first devise a new near-field multi-beam codebook by sparsely activating a portion of antennas to form a sparse linear array (SLA), hence generating multiple beams simultaneously by effective exploiting the near-field grating-lobs. Next, a two-stage near-field beam training method is proposed, for which several candidate user locations are identified firstly based on multi-beam sweeping over time, followed by the second stage to further determine the true user location with a small number of single-beam sweeping. Finally, numerical results show that our proposed multi-beam training method significantly reduces the beam training overhead of conventional single-beam training methods, yet achieving comparable rate performance in data transmission.

* In this paper, a novel near-field multi-beam training scheme is proposed by sparsely activating a portion of antennas to form a sparse linear array

Via

Access Paper or Ask Questions