Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Xie

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment

May 26, 2025

Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Chen Xie

Abstract:The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accelerating training with Dual Modality Alignment. Our method introduces a novel alignment pipeline leveraging both text and speech modalities: text-guided alignment, which incorporates contextual representations, and speech-guided alignment, which refines semantic representations. By aligning hidden states with discriminative features, our training scheme reduces the reliance on diffusion models for learning complex representations. Extensive experiments demonstrate that A-DMA doubles the convergence speed while achieving superior performance over baselines. Code and demo samples are available at: https://github.com/ZhikangNiu/A-DMA

* Interspeech 2025

Via

Access Paper or Ask Questions

EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy

Apr 24, 2025

Haodi Yao, Fenghua He, Ning Hao, Chen Xie

Abstract:The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.

Via

Access Paper or Ask Questions

Empowering AI to Generate Better AI Code: Guided Generation of Deep Learning Projects with LLMs

Apr 21, 2025

Chen Xie, Mingsheng Jiao, Xiaodong Gu, Beijun Shen

Abstract:While large language models (LLMs) have been widely applied to code generation, they struggle with generating entire deep learning projects, which are characterized by complex structures, longer functions, and stronger reliance on domain knowledge than general-purpose code. An open-domain LLM often lacks coherent contextual guidance and domain expertise for specific projects, making it challenging to produce complete code that fully meets user requirements. In this paper, we propose a novel planning-guided code generation method, DLCodeGen, tailored for generating deep learning projects. DLCodeGen predicts a structured solution plan, offering global guidance for LLMs to generate the project. The generated plan is then leveraged to retrieve semantically analogous code samples and subsequently abstract a code template. To effectively integrate these multiple retrieval-augmented techniques, a comparative learning mechanism is designed to generate the final code. We validate the effectiveness of our approach on a dataset we build for deep learning code generation. Experimental results demonstrate that DLCodeGen outperforms other baselines, achieving improvements of 9.7% in CodeBLEU and 3.6% in human evaluation metrics.

Via

Access Paper or Ask Questions

HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation

Mar 10, 2025

Xingzu Zhan, Chen Xie, Haoran Sun, Xiaochun Mai

Abstract:Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.

* 11pages,3figures,

Via

Access Paper or Ask Questions

GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction

Sep 30, 2024

Chen Xie, Ciyun Lin, Xiaoyu Zheng, Bowen Gong, Dayong Wu, Antonio M. López

Figure 1 for GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction

Figure 2 for GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction

Figure 3 for GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction

Figure 4 for GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction

Abstract:Understanding and predicting pedestrian crossing behavioral intention is crucial for autonomous vehicles driving safety. Nonetheless, challenges emerge when using promising images or environmental context masks to extract various factors for time-series network modeling, causing pre-processing errors or a loss in efficiency. Typically, pedestrian positions captured by onboard cameras are often distorted and do not accurately reflect their actual movements. To address these issues, GTransPDM -- a Graph-embedded Transformer with a Position Decoupling Module -- was developed for pedestrian crossing intention prediction by leveraging multi-modal features. First, a positional decoupling module was proposed to decompose the pedestrian lateral movement and simulate depth variations in the image view. Then, a graph-embedded Transformer was designed to capture the spatial-temporal dynamics of human pose skeletons, integrating essential factors such as position, skeleton, and ego-vehicle motion. Experimental results indicate that the proposed method achieves 92% accuracy on the PIE dataset and 87% accuracy on the JAAD dataset, with a processing speed of 0.05ms. It outperforms the state-of-the-art in comparison.

Via

Access Paper or Ask Questions

Language-Driven Interactive Traffic Trajectory Generation

May 24, 2024

Junkai Xia, Chenxin Xu, Qingyao Xu, Chen Xie, Yanfeng Wang, Siheng Chen

Figure 1 for Language-Driven Interactive Traffic Trajectory Generation

Figure 2 for Language-Driven Interactive Traffic Trajectory Generation

Figure 3 for Language-Driven Interactive Traffic Trajectory Generation

Figure 4 for Language-Driven Interactive Traffic Trajectory Generation

Abstract:Realistic trajectory generation with natural language control is pivotal for advancing autonomous vehicle technology. However, previous methods focus on individual traffic participant trajectory generation, thus failing to account for the complexity of interactive traffic dynamics. In this work, we propose InteractTraj, the first language-driven traffic trajectory generator that can generate interactive traffic trajectories. InteractTraj interprets abstract trajectory descriptions into concrete formatted interaction-aware numerical codes and learns a mapping between these formatted codes and the final interactive trajectories. To interpret language descriptions, we propose a language-to-code encoder with a novel interaction-aware encoding strategy. To produce interactive traffic trajectories, we propose a code-to-trajectory decoder with interaction-aware feature aggregation that synergizes vehicle interactions with the environmental map and the vehicle moves. Extensive experiments show our method demonstrates superior performance over previous SoTA methods, offering a more realistic generation of interactive traffic trajectories with high controllability via diverse natural language commands. Our code is available at https://github.com/X1a-jk/InteractTraj.git

Via

Access Paper or Ask Questions

HW-SW Optimization of DNNs for Privacy-preserving People Counting on Low-resolution Infrared Arrays

Feb 02, 2024

Matteo Risso, Chen Xie, Francesco Daghero, Alessio Burrello, Seyedmorteza Mollaei, Marco Castellano, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

Figure 1 for HW-SW Optimization of DNNs for Privacy-preserving People Counting on Low-resolution Infrared Arrays

Figure 2 for HW-SW Optimization of DNNs for Privacy-preserving People Counting on Low-resolution Infrared Arrays

Figure 3 for HW-SW Optimization of DNNs for Privacy-preserving People Counting on Low-resolution Infrared Arrays

Figure 4 for HW-SW Optimization of DNNs for Privacy-preserving People Counting on Low-resolution Infrared Arrays

Abstract:Low-resolution infrared (IR) array sensors enable people counting applications such as monitoring the occupancy of spaces and people flows while preserving privacy and minimizing energy consumption. Deep Neural Networks (DNNs) have been shown to be well-suited to process these sensor data in an accurate and efficient manner. Nevertheless, the space of DNNs' architectures is huge and its manual exploration is burdensome and often leads to sub-optimal solutions. To overcome this problem, in this work, we propose a highly automated full-stack optimization flow for DNNs that goes from neural architecture search, mixed-precision quantization, and post-processing, down to the realization of a new smart sensor prototype, including a Microcontroller with a customized instruction set. Integrating these cross-layer optimizations, we obtain a large set of Pareto-optimal solutions in the 3D-space of energy, memory, and accuracy. Deploying such solutions on our hardware platform, we improve the state-of-the-art achieving up to 4.2x model size reduction, 23.8x code size reduction, and 15.38x energy reduction at iso-accuracy.

* This paper has been accepted for publication in the DATE 2024 conference IEEE

Via

Access Paper or Ask Questions

Efficient Deep Learning Models for Privacy-preserving People Counting on Low-resolution Infrared Arrays

Apr 12, 2023

Chen Xie, Francesco Daghero, Yukai Chen, Marco Castellano, Luca Gandolfi, Andrea Calimera, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

Abstract:Ultra-low-resolution Infrared (IR) array sensors offer a low-cost, energy-efficient, and privacy-preserving solution for people counting, with applications such as occupancy monitoring. Previous work has shown that Deep Learning (DL) can yield superior performance on this task. However, the literature was missing an extensive comparative analysis of various efficient DL architectures for IR array-based people counting, that considers not only their accuracy, but also the cost of deploying them on memory- and energy-constrained Internet of Things (IoT) edge nodes. In this work, we address this need by comparing 6 different DL architectures on a novel dataset composed of IR images collected from a commercial 8x8 array, which we made openly available. With a wide architectural exploration of each model type, we obtain a rich set of Pareto-optimal solutions, spanning cross-validated balanced accuracy scores in the 55.70-82.70% range. When deployed on a commercial Microcontroller (MCU) by STMicroelectronics, the STM32L4A6ZG, these models occupy 0.41-9.28kB of memory, and require 1.10-7.74ms per inference, while consuming 17.18-120.43 $\mu$J of energy. Our models are significantly more accurate than a previous deterministic method (up to +39.9%), while being up to 3.53x faster and more energy efficient. Further, our models' accuracy is comparable to state-of-the-art DL solutions on similar resolution sensors, despite a much lower complexity. All our models enable continuous, real-time inference on a MCU-based IoT node, with years of autonomous operation without battery recharging.

* This article has been accepted for publication in IEEE Internet of Things Journal

Via

Access Paper or Ask Questions

Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks

Sep 02, 2022

Francesco Daghero, Alessio Burrello, Chen Xie, Marco Castellano, Luca Gandolfi, Andrea Calimera, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

Figure 1 for Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks

Figure 2 for Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks

Figure 3 for Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks

Figure 4 for Human Activity Recognition on Microcontrollers with Quantized and Adaptive Deep Neural Networks

Abstract:Human Activity Recognition (HAR) based on inertial data is an increasingly diffused task on embedded devices, from smartphones to ultra low-power sensors. Due to the high computational complexity of deep learning models, most embedded HAR systems are based on simple and not-so-accurate classic machine learning algorithms. This work bridges the gap between on-device HAR and deep learning, proposing a set of efficient one-dimensional Convolutional Neural Networks (CNNs) deployable on general purpose microcontrollers (MCUs). Our CNNs are obtained combining hyper-parameters optimization with sub-byte and mixed-precision quantization, to find good trade-offs between classification results and memory occupation. Moreover, we also leverage adaptive inference as an orthogonal optimization to tune the inference complexity at runtime based on the processed input, hence producing a more flexible HAR system. With experiments on four datasets, and targeting an ultra-low-power RISC-V MCU, we show that (i) We are able to obtain a rich set of Pareto-optimal CNNs for HAR, spanning more than 1 order of magnitude in terms of memory, latency and energy consumption; (ii) Thanks to adaptive inference, we can derive >20 runtime operating modes starting from a single CNN, differing by up to 10% in classification scores and by more than 3x in inference complexity, with a limited memory overhead; (iii) on three of the four benchmarks, we outperform all previous deep learning methods, reducing the memory occupation by more than 100x. The few methods that obtain better performance (both shallow and deep) are not compatible with MCU deployment. (iv) All our CNNs are compatible with real-time on-device HAR with an inference latency <16ms. Their memory occupation varies in 0.05-23.17 kB, and their energy consumption in 0.005 and 61.59 uJ, allowing years of continuous operation on a small battery supply.

* ACM Transactions on Embededded Computing Systems, Vol 21, Issue 4, Article 46 (July 2022), 28 pages

Via

Access Paper or Ask Questions

Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers

May 27, 2022

Francesco Daghero, Alessio Burrello, Chen Xie, Luca Benini, Andrea Calimera, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari

Figure 1 for Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers

Figure 2 for Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers

Figure 3 for Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers

Figure 4 for Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers

Abstract:Random Forests (RFs) are widely used Machine Learning models in low-power embedded devices, due to their hardware friendly operation and high accuracy on practically relevant tasks. The accuracy of a RF often increases with the number of internal weak learners (decision trees), but at the cost of a proportional increase in inference latency and energy consumption. Such costs can be mitigated considering that, in most applications, inputs are not all equally difficult to classify. Therefore, a large RF is often necessary only for (few) hard inputs, and wasteful for easier ones. In this work, we propose an early-stopping mechanism for RFs, which terminates the inference as soon as a high-enough classification confidence is reached, reducing the number of weak learners executed for easy inputs. The early-stopping confidence threshold can be controlled at runtime, in order to favor either energy saving or accuracy. We apply our method to three different embedded classification tasks, on a single-core RISC-V microcontroller, achieving an energy reduction from 38% to more than 90% with a drop of less than 0.5% in accuracy. We also show that our approach outperforms previous adaptive ML methods for RFs.

* 2021 IFIP/IEEE 29th International Conference on Very Large Scale Integration (VLSI-SoC), 2021, pp. 1-6
* Published in: 2021 IFIP/IEEE 29th International Conference on Very Large Scale Integration (VLSI-SoC), 2021

Via

Access Paper or Ask Questions