Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiahao Zhou

MiDashengLM: Efficient Audio Understanding with General Audio Captions

Aug 06, 2025

Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou

Figure 1 for MiDashengLM: Efficient Audio Understanding with General Audio Captions

Figure 2 for MiDashengLM: Efficient Audio Understanding with General Audio Captions

Figure 3 for MiDashengLM: Efficient Audio Understanding with General Audio Captions

Figure 4 for MiDashengLM: Efficient Audio Understanding with General Audio Captions

Abstract:Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

Via

Access Paper or Ask Questions

VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Mar 05, 2024

Jiahao Zhou, Chen Long, Yue Xie, Jialiang Wang, Boheng Li, Haiping Wang, Zhe Chen, Zhen Dong

Figure 1 for VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Figure 2 for VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Figure 3 for VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Figure 4 for VEnvision3D: A Synthetic Perception Dataset for 3D Multi-Task Model Research

Abstract:Developing a unified multi-task foundation model has become a critical challenge in computer vision research. In the current field of 3D computer vision, most datasets only focus on single task, which complicates the concurrent training requirements of various downstream tasks. In this paper, we introduce VEnvision3D, a large 3D synthetic perception dataset for multi-task learning, including depth completion, segmentation, upsampling, place recognition, and 3D reconstruction. Since the data for each task is collected in the same environmental domain, sub-tasks are inherently aligned in terms of the utilized data. Therefore, such a unique attribute can assist in exploring the potential for the multi-task model and even the foundation model without separate training methods. Meanwhile, capitalizing on the advantage of virtual environments being freely editable, we implement some novel settings such as simulating temporal changes in the environment and sampling point clouds on model surfaces. These characteristics enable us to present several new benchmarks. We also perform extensive studies on multi-task end-to-end models, revealing new observations, challenges, and opportunities for future research. Our dataset and code will be open-sourced upon acceptance.

Via

Access Paper or Ask Questions

Optimized Design Method for Satellite Constellation Configuration Based on Real-time Coverage Area Evaluation

Sep 16, 2022

Jiahao Zhou, Boheng Li, Qingxiang Meng

Figure 1 for Optimized Design Method for Satellite Constellation Configuration Based on Real-time Coverage Area Evaluation

Figure 2 for Optimized Design Method for Satellite Constellation Configuration Based on Real-time Coverage Area Evaluation

Figure 3 for Optimized Design Method for Satellite Constellation Configuration Based on Real-time Coverage Area Evaluation

Figure 4 for Optimized Design Method for Satellite Constellation Configuration Based on Real-time Coverage Area Evaluation

Abstract:When using constellation synergy to image large areas for reconnaissance, it is required to achieve the coverage capability requirements with minimal consumption of observation resources to obtain the most optimal constellation observation scheme. With the minimum number of satellites and meeting the real-time ground coverage requirements as the optimization objectives, this paper proposes an optimized design of satellite constellation configuration for full coverage of large-scale regional imaging by using an improved simulated annealing algorithm combined with the real-time coverage evaluation method of hexagonal discretization. The algorithm can adapt to experimental conditions, has good efficiency, and can meet industrial accuracy requirements. The effectiveness and adaptability of the algorithm are tested in simulation applications.

* the 29th International Conference on Geoinformatics, EI

Via

Access Paper or Ask Questions

Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression

Jul 12, 2021

Shaowu Chen, Jiahao Zhou, Weize Sun, Lei Huang

Figure 1 for Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression

Figure 2 for Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression

Figure 3 for Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression

Figure 4 for Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression

Abstract:Deep convolutional neural networks (CNNs) with a large number of parameters requires huge computational resources, which has limited the application of CNNs on resources constrained appliances. Decomposition-based methods, therefore, have been utilized to compress CNNs in recent years. However, since the compression factor and performance are negatively correlated, the state-of-the-art works either suffer from severe performance degradation or have limited low compression factors. To overcome these problems, unlike previous works compressing layers separately, we propose to compress CNNs and alleviate performance degradation via joint matrix decomposition. The idea is inspired by the fact that there are lots of repeated modules in CNNs, and by projecting weights with the same structures into the same subspace, networks can be further compressed and even accelerated. In particular, three joint matrix decomposition schemes are developed, and the corresponding optimization approaches based on Singular Values Decomposition are proposed. Extensive experiments are conducted across three challenging compact CNNs and 3 benchmark data sets to demonstrate the superior performance of our proposed algorithms. As a result, our methods can compress the size of ResNet-34 by 22x with slighter accuracy degradation compared with several state-of-the-art methods.

* Code is publicly available on GitHub: https://github.com/ShaowuChen/JointSVD

Via

Access Paper or Ask Questions