Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zixin Zhang

HKUST

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Jun 10, 2025

Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li(+66 more)

Abstract:Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

* 12 pages, 3 figures

Via

Access Paper or Ask Questions

ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

May 23, 2025

Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, Ying-Cong Chen

Abstract:With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems. Project page: https://github.com/LitaoGuo/ComfyMind

* Project page: https://github.com/LitaoGuo/ComfyMind

Via

Access Paper or Ask Questions

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Feb 18, 2025

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen(+135 more)

Abstract:Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Via

Access Paper or Ask Questions

Comateformer: Combined Attention Transformer for Semantic Sentence Matching

Dec 10, 2024

Bo Li, Di Liang, Zixin Zhang

Abstract:The Transformer-based model have made significant strides in semantic matching tasks by capturing connections between phrase pairs. However, to assess the relevance of sentence pairs, it is insufficient to just examine the general similarity between the sentences. It is crucial to also consider the tiny subtleties that differentiate them from each other. Regrettably, attention softmax operations in transformers tend to miss these subtle differences. To this end, in this work, we propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer). In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties. Unlike traditional attention mechanisms that merely adjust the weights of input tokens, our proposed method learns how to combine, subtract, or resize specific vectors when building a representation. Moreover, our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores. This allows for a more meaningful representation of relationships between sentences. To evaluate the performance of our proposed model, we conducted extensive experiments on ten public real-world datasets and robustness testing. Experimental results show that our method achieves consistent improvements.

* This paper is accepted by 27th EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)

Via

Access Paper or Ask Questions

Robots with Attitude: Singularity-Free Quaternion-Based Model-Predictive Control for Agile Legged Robots

Sep 17, 2024

Zixin Zhang, John Z. Zhang, Shuo Yang, Zachary Manchester

Abstract:We present a model-predictive control (MPC) framework for legged robots that avoids the singularities associated with common three-parameter attitude representations like Euler angles during large-angle rotations. Our method parameterizes the robot's attitude with singularity-free unit quaternions and makes modifications to the iterative linear-quadratic regulator (iLQR) algorithm to deal with the resulting geometry. The derivation of our algorithm requires only elementary calculus and linear algebra, deliberately avoiding the abstraction and notation of Lie groups. We demonstrate the performance and computational efficiency of quaternion MPC in several experiments on quadruped and humanoid robots.

Via

Access Paper or Ask Questions

PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

Aug 10, 2023

Wentao Hu, Jia Zheng, Zixin Zhang, Xiaojun Yuan, Jian Yin, Zihan Zhou

Figure 1 for PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

Figure 2 for PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

Figure 3 for PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

Figure 4 for PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

Abstract:In this paper, we develop a new method to automatically convert 2D line drawings from three orthographic views into 3D CAD models. Existing methods for this problem reconstruct 3D models by back-projecting the 2D observations into 3D space while maintaining explicit correspondence between the input and output. Such methods are sensitive to errors and noises in the input, thus often fail in practice where the input drawings created by human designers are imperfect. To overcome this difficulty, we leverage the attention mechanism in a Transformer-based sequence generation model to learn flexible mappings between the input and output. Further, we design shape programs which are suitable for generating the objects of interest to boost the reconstruction accuracy and facilitate CAD modeling applications. Experiments on a new benchmark dataset show that our method significantly outperforms existing ones when the inputs are noisy or incomplete.

* To Appear in ICCV 2023. The first three authors contributed equally to this work. The project page is at https://manycore-research.github.io/PlankAssembly

Via

Access Paper or Ask Questions

Cerberus: Low-Drift Visual-Inertial-Leg Odometry For Agile Locomotion

Sep 16, 2022

Shuo Yang, Zixin Zhang, Zhengyu Fu, Zachary Manchester

Figure 1 for Cerberus: Low-Drift Visual-Inertial-Leg Odometry For Agile Locomotion

Figure 2 for Cerberus: Low-Drift Visual-Inertial-Leg Odometry For Agile Locomotion

Figure 3 for Cerberus: Low-Drift Visual-Inertial-Leg Odometry For Agile Locomotion

Figure 4 for Cerberus: Low-Drift Visual-Inertial-Leg Odometry For Agile Locomotion

Abstract:We present an open-source Visual-Inertial-Leg Odometry (VILO) state estimation solution, Cerberus, for legged robots that estimates position precisely on various terrains in real time using a set of standard sensors, including stereo cameras, IMU, joint encoders, and contact sensors. In addition to estimating robot states, we also perform online kinematic parameter calibration and contact outlier rejection to substantially reduce position drift. Hardware experiments in various indoor and outdoor environments validate that calibrating kinematic parameters within the Cerberus can reduce estimation drift to lower than 1% during long distance high speed locomotion. Our drift results are better than any other state estimation method using the same set of sensors reported in the literature. Moreover, our state estimator performs well even when the robot is experiencing large impacts and camera occlusion. The implementation of the state estimator, along with the datasets used to compute our results, are available at https://github.com/ShuoYangRobotics/Cerberus.

* 7 pages, 6 figures, submitted to IEEE ICRA 2023

Via

Access Paper or Ask Questions

Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Dec 06, 2020

Linhan Yang, Xudong Han, Weijie Guo, Zixin Zhang, Fang Wan, Jia Pan, Chaoyang Song

Figure 1 for Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Figure 2 for Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Figure 3 for Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Figure 4 for Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Abstract:Over the past few decades, efforts have been made towards robust robotic grasping, and therefore dexterous manipulation. The soft gripper has shown their potential in robust grasping due to their inherent properties-low, control complexity, and high adaptability. However, the deformation of the soft gripper when interacting with objects bring inaccuracy of grasped objects, which causes instability for robust grasping and further manipulation. In this paper, we present an omni-directional adaptive soft finger that can sense deformation based on embedded optical fibers and the application of machine learning methods to interpret transmitted light intensities. Furthermore, to use tactile information provided by a soft finger, we design a low-cost and multi degrees of freedom gripper to conform to the shape of objects actively and optimize grasping policy, which is called Rigid-Soft Interactive Grasping. Two main advantages of this grasping policy are provided: one is that a more robust grasping could be achieved through an active adaptation; the other is that the tactile information collected could be helpful for further manipulation.

* 11 pages, 6 figures, submitted to IEEE ICRA 2021

Via

Access Paper or Ask Questions